Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Histogram field mapper that supports percentiles aggregations. #48580

Merged
merged 30 commits into from
Nov 28, 2019

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Oct 28, 2019

This PR explores the addition of a new histogram field mapper that consist in a pre-aggregated format of numerical data to be used in percentiles aggregations.

Mapper

The new field is defined in a mapping using the following structure:

PUT /example
{
    "mappings": {
        "properties": {
            "aggregated": {
                "type": "histogram"
            }
        }
    }
}

And it can be populated using the following structure:

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.2, 0.3, 0.4, 0.5],
        "counts" : [5, 3, 14, 6, 4]
    }
}

where values is an array of doubles and counts is an array of integers and must have the same length. This format is up for discussion but the reasons to choose this format is that they can be easily generated from the existing histograms.

For TDigest, they can be generated using the following code (see example):

                List<Double> values = new ArrayList<>();
                List<Integer> counts = new ArrayList<>();
                Collection<Centroid> centroids = histogram.centroids();
                for (Centroid centroid : centroids) {
                    values.add(centroid.mean());
                    counts.add(centroid.count());
                }

For HDR histograms, hey can be generated using the following code (see example):

                List<Double> values = new ArrayList<>();
                List<Integer> counts = new ArrayList<>();
                Iterator<DoubleHistogramIterationValue> iterator = histogram.recordedValues().iterator();
                while (iterator.hasNext()) {
                    DoubleHistogramIterationValue histValue = iterator.next();
                    values.add(histValue.getValueIteratedTo());
                    counts.add(Math.toIntExact(histValue.getCountAtValueIteratedTo()));
                }

This structure is stored as a binary doc value

Aggregations

This field can be used in standard percentile and percentile_ranks aggregations. They can be used together with standard numeric fields. In order for this aggregations to support this new format the following interfaces has been created:

  • IndexHistogramFieldData: A new specialisation of field data for histograms.
  • AtomicHistogramFieldData: A new specialisation of atomic field data for histograms.
  • HistogramValues: The docVales returned by the atomic field data.
  • HistogramValue: A doc value representing one of those histograms.

The aggregations has been updated accordingly to be able to understand this new field data.

Open questions

  • Accuracy and suitability of the format.

  • The field mapper does not support as input an array of histograms, does it need it to support it?

  • Overflow of aggregations? As we are adding pre-aggregated data, it can theoretically happen that aggregation histogram can more easily overflow (I think the maximum number of total counts is LONG.MAX_VALUE).

Relates #48578

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it a lot. My main concern is that the parsing of aggregations was made lenient for this to work by accepting any field in the parser. I think it's fine that we use generic objects internally like ValuesSource and do instanceof calls to see how these objects should be handled. However I'd like the parser to be strict and fail if a field is provided that is neither a numeric field or a histogram field?

Let's add docs?

Accuracy and suitability of the format.

It looks good to me.

The field mapper does not support as input an array of histograms, does it need it to support it?

I don't think it needs to. We could document that this field only supports single values, like we do for vector fields (you might want to borrow the logic that vectors used to make sure fields are actually single-valued).

Overflow of aggregations? As we are adding pre-aggregated data, it can theoretically happen that aggregation histogram can more easily overflow (I think the maximum number of total counts is LONG.MAX_VALUE).

I think we should ignore this problem on our end and push this problem to the upstream libraries, since this is a general problem that they have given that they allow collecting multiple instances of the same value at once. The fact that you require counts to be integers and reject longs sounds to me like the right thing to do on our end.

return new HDRPercentilesAggregator(name, valuesSource, searchContext, parent, percents, numberOfSignificantValueDigits, keyed,
config.format(), pipelineAggregators, metaData);
config.format(), pipelineAggregators, metaData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try to undo formatting changes in this file and a couple other ones to keep the diff readable?

@@ -80,7 +79,7 @@
static {
PARSER = new ConstructingObjectParser<>(PercentileRanksAggregationBuilder.NAME, false,
(a, context) -> new PercentileRanksAggregationBuilder(context, (List) a[0]));
ValuesSourceParserHelper.declareNumericFields(PARSER, true, false, false);
ValuesSourceParserHelper.declareAnyFields(PARSER, true, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to be less lenient and require a field that is either numeric or a histogram?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense and that is the tricky part. @not-napoleon I might want to borrow a bit of your thoughts on how to do this in the best way possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser doesn't currently support that, and right now the convention is to use ANY for aggregations that can accept more than one type of input. Cleaning that up is a goal of the ValuesSource refactor effort I'm currently working on (see #42949)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @not-napoleon, let's keep this TODO for later then @iverase ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think to be able to support that we require the effort that @not-napoleon is working on (thanks for taking a look!).

return getHistogramValue(values.binaryValue());
} catch (IOException e) {
throw new IllegalStateException("Cannot load doc value", e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make histogram() throw an IOException for consistency with BinaryDocValues#binaryValue?

} catch (IOException e) {
throw new IllegalStateException("Cannot load doc values", e);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should throw an UnsupportedOperationException in the two above methods.

@Override
public SortField sortField(Object missingValue, MultiValueMode sortMode,
XFieldComparatorSource.Nested nested, boolean reverse) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should throw an UnsupportedOperationException too

private HistogramValue getHistogramValue(final BytesRef bytesRef) throws IOException {
final ByteBufferStreamInput streamInput = new ByteBufferStreamInput(
ByteBuffer.wrap(bytesRef.bytes, bytesRef.offset, bytesRef.length));
final int numValues = streamInput.readVInt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also avoid storing the length and consider the iterator exhausted when all bytes have been read.

context.path().add(simpleName());
try {
List<Double> values = null;
List<Integer> counts = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use native variants, e.g. IntArrayList and DoubleArrayList?

if (values.size() == 0) {
throw new MapperParsingException("error parsing field ["
+ name() + "], arrays for values and counts cannot be empty");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should actually fail it or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, maybe we should require that values come in order and fail if there are duplicate values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Respect empty arrays, yes we can just ignore it instead of failing.

I like he idea of requiring values to be ordered and disallow duplicates. Before implementing I will wait for more input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code so in case of empty arrays we just ignore it as we do with null values.

In addition we require now that values are provided in increasing order.

@felixbarny
Copy link
Member

Really looking forward to this 😍
The API also looks great to me!

Question: does the values array have to contain the same values for each doc? In other words, does the first doc determine the structure of the histogram?

Is it possible to extend the range in subsequent documents? Example:

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.2],
        "counts" : [1, 2]
    }
}

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.2, 0.3],
        "counts" : [1, 2, 3]
    }
}

Is it possible to use different buckets in subsequent documents? Example:

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.2],
        "counts" : [1, 2]
    }
}

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.15, 0.2, 0.25],
        "counts" : [1, 2, 3]
    }
}

Is it possible to omit buckets with a value of zero? Example:

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.2, 0.3],
        "counts" : [1, 2, 3]
    }
}

POST /example/_doc
{
    "aggregated" : {
        "values" :[0.1, 0.3],
        "counts" : [1, 0, 3]
    }
}

These are not necessarily requirements we have in APM, just trying to get a feel for what's possible.

import java.util.List;


public class HistogramAggregationTests extends ESSingleNodeTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming is a bit confusing since there is also a histogram aggregation which is completely different to this new field. Maybe we should call this HistogramPercentileAggregationTests or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally created this test to check different approaches. I renamed it following your suggestions.

@iverase
Copy link
Contributor Author

iverase commented Oct 29, 2019

@felixbarny

Is it possible to extend the range in subsequent documents?
Is it possible to use different buckets in subsequent documents?

Yes that is possible as every histogram is independent to the others so it can have different buckets and different number of buckets.

Is it possible to omit buckets with a value of zero?

The count for a bucket can be zero but the bucket must be present. In general the length of the values and counts arrays must be the same. We are still considering some other changes, in particular we might require that buckets are ordered and you cannot have twice the same bucket.

@colings86
Copy link
Contributor

you cannot have twice the same bucket.

One thing to note is that it's possible in TDigest particularly to have multiple buckets with the same value. This can happen if the same value is inserted enough times to exceed the threshold for the centroid which will force a new centroid with the same value to be created. IF we move to reject histograms with multiple of the same bucket, it will make the code for extracting the TDigest histogram (from your PR description) less simple since the user will need to keep track of whether the value has changed when moving to the next centroid.

I don't see a problem with requiring ordered buckets though

@jpountz
Copy link
Contributor

jpountz commented Oct 29, 2019

Good point Colin, I agree that rejecting duplicates would make it tricky to work with t-digests.

@iverase
Copy link
Contributor Author

iverase commented Oct 30, 2019

If we allow duplicates, I don't think we require buckets to be ordered. wdyt?

@jpountz
Copy link
Contributor

jpountz commented Oct 30, 2019

I think we should at least make sure values are sorted in the doc-value representation. Then we could either sort buckets ourselves or require users to provide us with sorted buckets.

@iverase iverase added the v7.6.0 label Nov 27, 2019
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like the ignoreMalformed logic to be a bit more robust, but other than that it looks good to me.

can be extracted either from specific numeric fields in the documents, or
be generated by a provided script.
over numeric values extracted from the aggregated documents. These values can be
generated by a provided script or extracted from specific numeric or histogram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a link to the histogram field?

can be extracted either from specific numeric fields in the documents, or
be generated by a provided script.
over numeric values extracted from the aggregated documents. These values can be
generated by a provided script or extracted from specific numeric or histogram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a link to the histogram field?

Map<String, Object> node, ParserContext parserContext)
throws MapperParsingException {
Builder builder = new HistogramFieldMapper.Builder(name);
parseField(builder, name, node, parserContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not call parseField, none of the properties that it parses are supported by this field?

Copy link
Contributor Author

@iverase iverase Nov 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am supporting them:

  • user can disable docValues
  • If user set index, index options or stored values, an error will be thrown. Note setting index or stored values to false is allowed.

I am not handling boost, similarity and copy_to. Shall we throw an error if user defines those fields?

} else if (count > 0) {
// we do not add elements with count == 0
streamOutput.writeDouble(values.get(i));
streamOutput.writeVInt(count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest putting the count before the values, it might make it easier to better compress in the future by stealing bits of the count.

if (ignoreMalformed.value() == false) {
throw new MapperParsingException("failed to parse field [{}] of type [{}]",
ex, fieldType().name(), fieldType().typeName());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what XContentSubParser has been designed for. See #35603. Maybe it would be more robust? By the way looking at the latest version of GeoShapeFieldMapper, it looks like it no longer handles ignoreMalformed correctly, or am I misreading it cc @imotov ?

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docs look good. I just left a few minor comments/tweaks to make a certain aspect more explicit, otherwise I think they are fine 👍

Been following the code changes from afar, think Adrien has the code review aspect covered so I'll defer to him there :)

histogram was constructed. It is important to consider the percentiles aggregation mode that will be used
to build it. Some possibilities include:

- For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, histograms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, trying to tweak this a little to make it more explicit, so the user knows what the value/count fields do.

  • For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, the values array represents the mean centroid positions and the counts array represents the number of values that are attributed to each centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is carried over in the histogram.

WDYT?

can be built by using the mean value of the centroids and the centroid's count. If the algorithm has already
started to approximate the percentiles, this inaccuracy is carried over in the histogram.

- For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, histograms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly,

  • For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, the values array represents fixed upper limits of each bucket interval, and the counts array represents the number of values that are attributed to each interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits), therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.

??

can be created by using the recorded values and the count at that value. This implementation maintains a fixed worse-case
percentage error (specified as a number of significant digits), therefore the value used when generating the histogram
would be the maximum accuracy you can achieve at aggregation time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another sentence/paragraph at the end?

The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

Or something similar... trying to convey to the user that how they index the data is important and they should chose upfront.

@iverase
Copy link
Contributor Author

iverase commented Nov 28, 2019

@jpountz, I changed the logic so now I am using XContentSubParser to handle ignore malformed (as it should be).

My only doubt is about parse fields. I am supporting doc values so a user can disable them. It might be useful if a user wants to store the histogram fields (already parsed) without doc values so at a later stage it can reindex them again with them, wdyt?

should be handle boost, similarity and copy_to and throw an error if the user defines them?

@jpountz
Copy link
Contributor

jpountz commented Nov 28, 2019

Given that it's always easier to add features than remove them, I'd be in favor of only supporting ignore_malformed for now?

@iverase
Copy link
Contributor Author

iverase commented Nov 28, 2019

Great, that makes things simpler. I removed support for parse fields

token = context.parser().nextToken();
if (subParser != null) {
while (token != null) {
token = subParser.nextToken();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do subParser.close() instead?

`histogram` fields are primarily intended for use with aggregations. To make it
more readily accessible for aggregations, `histogram` field data is stored as a
binary <<doc-values,doc values>> and not indexed. Its size in bytes is at most
`12 * numValues`, where `numValues` is the length of the provided arrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's actually 13 since vints can take up to 5 bytes.

@iverase
Copy link
Contributor Author

iverase commented Nov 28, 2019

@elasticmachine run elasticsearch-ci/bwc
@elasticmachine run elasticsearch-ci/default-distro

@iverase
Copy link
Contributor Author

iverase commented Nov 28, 2019

@elasticmachine update branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.