-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollup API with term on an array field give incorrect aggregation results #45015
Comments
Pinging @elastic/es-analytics-geo |
This is expected, in your example the grouping key is a composite of the |
The use case is that I have a bunch of devices that I am tracking metrics for and need to analyze/graph/etc. There is a desire for an open "tags" section to let users of the system categorize the devices flexibly by putting in a list of arbitrary strings. This affects the graphs to allow one to select a set of tags to filter by, or not. If I leave the tags out of the rollup I cannot filter by them. If I include the tags as an array I get incorrect aggregation results. Since the tags are unstructured, using them as keys would mean constantly modifying the rollup job on the fly to add a big growing list of terms. |
+1, I am facing the same issue, with sum as well as terms aggregation on array fields in a rollup, please open this link for details => https://discuss.elastic.co/t/rollup-terms-aggregation-giving-wrong-result-with-array-fields/220300 |
We at the Guardian are also facing this issue. We've defined a rollup job according to the conditions below, rolling up on 11 fields (truncated for brevity) which all have a single value:
The primary store size of the resulting rollup index for the above job is approximately 13GB per day's worth of rolled up data. However, adding the "pageview.tags" field to the terms we roll up, which consists of an array of values, causes the primary store size of the rollup index to balloon - it's now in the region of 50GB per day's worth of rolled up data. We need to store the tags as an array to enable our users to filter on tags, but since the set of tags for a given article is generally static (give or take a small number of additions/removals once the piece has been published), we don't want or need to roll up each tag individually in a separate document. Is there a way to roll up each distinct set of tags instead? |
I don't think it's an issue but rather a feature that is missing here. From what I understand none of the use cases described in the issue require to have the
The |
As a workaround for this issue, we've experimented with writing out our array field (named This does work, and is just about queryable by tag if we modify our search request, but it dramatically increases the size of the rollup index- especially because, even if a document only has 2 or 3 tags, all 20 rollup tag fields are represented (mostly nulls) in the resulting document.
A solution properly supported within Elasticsearch would be great for us! Has there been a development that would work for us since #49830 was merged? |
I've spent two days trying to figure out why my aggregations against the original index and rolled up one are not matching. It's not doing its job if the same rollup query (including all documents) against source index and rolled up one is returning different results. We have a case where we store some events and a few flags against each of them. There aren't many of them but they aren't 'fixed' and we can't (and don't want to) have a field for each of them. Adding this field to rollup breaks rolled up results for us. Unfortunately, for this single reason we are likely to drop using this feature and try to aggregate data for long term storage ourselves. |
With the 8.7 release of Elasticsearch, we have made a new downsampling capability associated with the new time series datastreams functionality generally available (GA). This capability was in tech preview in ILM since 8.5. Downsampling provides a method to reduce the footprint of your time series data by storing it at reduced granularity. The downsampling process rolls up documents within a fixed time interval into a single summary document. Each summary document includes statistical representations of the original data: the min, max, sum, value_count, and average for each metric. Data stream time series dimensions are stored unchanged. Downsampling is superior to rollup because:
Because of the introduction of this new capability, we are deprecating the rollups functionality, which never left the Tech Preview/Experimental status, in favor of downsampling and thus we are closing this issue. We encourage you to migrate your solution to downsampling and take advantage of the new TSDB functionality. |
Elasticsearch version (
bin/elasticsearch --version
): v7.2.0 (Elastic Cloud)Plugins installed: []
JVM version (
java -version
): N/A (Elastic Cloud)OS version (
uname -a
if on a Unix-like system): Elastic CloudDescription of the problem including expected versus actual behavior:
Rollup jobs that list an array field as a term result in incorrect calculations. The records are split into one for each array value and then get aggregated multiple times. e.g. Rolling up an index with a single record that as a "value" of 2 and a "tag" field of [ "one", "two", "three" ] will put 3 records into the rollup index with "tag": "one", etc. Aggregations to sum value will then show a total of 6.
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
Create an index "array-test" with a mapping like:
{
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"tags": {
"type": "keyword",
"ignore_above": 1024
},
"value": {
"type": "long"
}
}
}
}
Insert a record like:
{
"@timestamp": "2019-07-30T12:00:00",
"value": 2,
"tags": [ "one", "two", "three" ]
}
Create a rollup job with "value" as a metric and "tags" as a term:
{
"index_pattern": "array-test",
"rollup_index": "array-test-rollup",
"cron": "* * * * * ?",
"page_size" :1000,
"groups" : {
"date_histogram": {
"field": "@timestamp",
"interval": "1m",
"delay": "1m"
},
"terms": {
"fields": ["tags"]
}
},
"metrics": [
{
"field": "value",
"metrics": ["min", "max", "sum", "avg"]
}
]
}
Start the job and wait a minute or two for it to roll up the index.
Run an aggregation against the original index /array-test/_search and see the sum of 2:
{
"size": 0,
"aggs" : {
"valuesum" : { "sum" : { "field" : "value" } }
}
}
Run the same aggregation against the rollup index /array-test-rollup/_rollup_search and see the sum of 6:
{
"size": 0,
"aggs" : {
"valuesum" : { "sum" : { "field" : "value" } }
}
}
The text was updated successfully, but these errors were encountered: