Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollup API with term on an array field give incorrect aggregation results #45015

Closed
kandelon opened this issue Jul 30, 2019 · 9 comments
Closed
Labels
:StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@kandelon
Copy link

Elasticsearch version (bin/elasticsearch --version): v7.2.0 (Elastic Cloud)

Plugins installed: []

JVM version (java -version): N/A (Elastic Cloud)

OS version (uname -a if on a Unix-like system): Elastic Cloud

Description of the problem including expected versus actual behavior:
Rollup jobs that list an array field as a term result in incorrect calculations. The records are split into one for each array value and then get aggregated multiple times. e.g. Rolling up an index with a single record that as a "value" of 2 and a "tag" field of [ "one", "two", "three" ] will put 3 records into the rollup index with "tag": "one", etc. Aggregations to sum value will then show a total of 6.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Create an index "array-test" with a mapping like:
    {
    "mappings": {
    "properties": {
    "@timestamp": {
    "type": "date"
    },
    "tags": {
    "type": "keyword",
    "ignore_above": 1024
    },
    "value": {
    "type": "long"
    }
    }
    }
    }

  2. Insert a record like:
    {
    "@timestamp": "2019-07-30T12:00:00",
    "value": 2,
    "tags": [ "one", "two", "three" ]
    }

  3. Create a rollup job with "value" as a metric and "tags" as a term:
    {
    "index_pattern": "array-test",
    "rollup_index": "array-test-rollup",
    "cron": "* * * * * ?",
    "page_size" :1000,
    "groups" : {
    "date_histogram": {
    "field": "@timestamp",
    "interval": "1m",
    "delay": "1m"
    },
    "terms": {
    "fields": ["tags"]
    }
    },
    "metrics": [
    {
    "field": "value",
    "metrics": ["min", "max", "sum", "avg"]
    }
    ]
    }

  4. Start the job and wait a minute or two for it to roll up the index.

  5. Run an aggregation against the original index /array-test/_search and see the sum of 2:
    {
    "size": 0,
    "aggs" : {
    "valuesum" : { "sum" : { "field" : "value" } }
    }
    }

  6. Run the same aggregation against the rollup index /array-test-rollup/_rollup_search and see the sum of 6:
    {
    "size": 0,
    "aggs" : {
    "valuesum" : { "sum" : { "field" : "value" } }
    }
    }

@jimczi jimczi added the :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data label Jul 30, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@jimczi
Copy link
Contributor

jimczi commented Jul 30, 2019

This is expected, in your example the grouping key is a composite of the @timestamp and the tags field so we create one rollup document for each unique group that records the min, max... of the value field. We don't have a way to keep a keyword field in rollup documents if it is not part of the grouping key (groups) since the number of unique terms could be huge (if you have a lot of different values aggregated in the same rollup document). Can you describe your use case and what you're trying to achieve here ? It's unclear to me if you want to group things on the tags field or if you just want to keep the unique values in each 1m aggregate.

@kandelon
Copy link
Author

The use case is that I have a bunch of devices that I am tracking metrics for and need to analyze/graph/etc. There is a desire for an open "tags" section to let users of the system categorize the devices flexibly by putting in a list of arbitrary strings. This affects the graphs to allow one to select a set of tags to filter by, or not. If I leave the tags out of the rollup I cannot filter by them. If I include the tags as an array I get incorrect aggregation results. Since the tags are unstructured, using them as keys would mean constantly modifying the rollup job on the fly to add a big growing list of terms.

@amey55
Copy link

amey55 commented Feb 21, 2020

+1, I am facing the same issue, with sum as well as terms aggregation on array fields in a rollup, please open this link for details => https://discuss.elastic.co/t/rollup-terms-aggregation-giving-wrong-result-with-array-fields/220300

@simone-smith
Copy link

We at the Guardian are also facing this issue. We've defined a rollup job according to the conditions below, rolling up on 11 fields (truncated for brevity) which all have a single value:

{
      "config": {
        "id": "pageviews_historical_job",
        "index_pattern": "pageviews",
        "rollup_index": "pageviews_historical",
        "cron": "30 29 17 * * ?",
        "groups": {
          "date_histogram": {
            "interval": "10m",
            "field": "pageview.dt",
            "delay": "14d",
            "time_zone": "UTC"
          },
          "terms": {
            "fields": [
              "pageview.path",
              ...
            ]
          }
        },
        "metrics": [
          
        ],
        "timeout": "20s",
        "page_size": 10000
      }
    }

The primary store size of the resulting rollup index for the above job is approximately 13GB per day's worth of rolled up data.

However, adding the "pageview.tags" field to the terms we roll up, which consists of an array of values, causes the primary store size of the rollup index to balloon - it's now in the region of 50GB per day's worth of rolled up data.

We need to store the tags as an array to enable our users to filter on tags, but since the set of tags for a given article is generally static (give or take a small number of additions/removals once the piece has been published), we don't want or need to roll up each tag individually in a separate document. Is there a way to roll up each distinct set of tags instead?

@jimczi
Copy link
Contributor

jimczi commented Feb 27, 2020

I don't think it's an issue but rather a feature that is missing here. From what I understand none of the use cases described in the issue require to have the tags in the group key ? This is a more a metric that you want to aggregate on each bucket:

{
    "groups": {
        "date_histogram": {
            "field": "@timestamp",
            "interval": "1m",
            "delay": "1m"
        }
    },
    "metrics": [
        {
            "field": "tags",
            "metrics": [
            	"terms_100"
            ]
        }
    ]
}

The terms_N metric doesn't exist but I think it could solve all the problems reported here.
The fact that we split multi-valued on grouped fields is by design so I don't think we can make progress on this end but we can try to extend the rollup metrics to handle simple terms aggregation with a limit on size. This should allow searching rollup data by tags and would remove the correlation with the grouping key. Keeping the document count correct would be challenging though, since we need to map the term and its count but this could eventually be handle by the new metric field that we're adding in #49830.
@csoulios, do you think it makes sense to consider the result of a simple terms aggregation as a metric ?

@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@rtyley
Copy link

rtyley commented Jul 8, 2020

As a workaround for this issue, we've experimented with writing out our array field (named tags) as tag0,tag1 ,tag2 ,tag3,...tag19, and rolling up on those 20 fields - so our rollup job definition now looks like this.

This does work, and is just about queryable by tag if we modify our search request, but it dramatically increases the size of the rollup index- especially because, even if a document only has 2 or 3 tags, all 20 rollup tag fields are represented (mostly nulls) in the resulting document.

The terms_N metric doesn't exist but I think it could solve all the problems reported here.

A solution properly supported within Elasticsearch would be great for us! Has there been a development that would work for us since #49830 was merged?

@michael-budnik
Copy link

michael-budnik commented Jan 25, 2021

I've spent two days trying to figure out why my aggregations against the original index and rolled up one are not matching.
I would expect arrays to be a pretty common case which would be supported easily by rollups.

It's not doing its job if the same rollup query (including all documents) against source index and rolled up one is returning different results.

We have a case where we store some events and a few flags against each of them. There aren't many of them but they aren't 'fixed' and we can't (and don't want to) have a field for each of them. Adding this field to rollup breaks rolled up results for us.

Unfortunately, for this single reason we are likely to drop using this feature and try to aggregate data for long term storage ourselves.

@wchaparro
Copy link
Member

With the 8.7 release of Elasticsearch, we have made a new downsampling capability associated with the new time series datastreams functionality generally available (GA). This capability was in tech preview in ILM since 8.5. Downsampling provides a method to reduce the footprint of your time series data by storing it at reduced granularity. The downsampling process rolls up documents within a fixed time interval into a single summary document. Each summary document includes statistical representations of the original data: the min, max, sum, value_count, and average for each metric. Data stream time series dimensions are stored unchanged.

Downsampling is superior to rollup because:

  • Downsampled indices are searched through the _search API
  • It is possible to query multiple downsampled indices together with raw data indices
  • The pre-aggregation is based on the metrics and time series definitions in the index mapping so very little configuration is required (i.e. much easier to add new time serieses)
  • Downsampling is managed as an action in ILM
  • It is possible to downsample a downsampled index, and reduce granularity as the index ages
  • The performance of the pre-aggregation process is superior in downsampling, as it builds on the time_series index mode infrastructure

Because of the introduction of this new capability, we are deprecating the rollups functionality, which never left the Tech Preview/Experimental status, in favor of downsampling and thus we are closing this issue. We encourage you to migrate your solution to downsampling and take advantage of the new TSDB functionality.

@wchaparro wchaparro closed this as not planned Won't fix, can't repro, duplicate, stale Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

9 participants