[Transform] Performance: Unexpected long runtime for date_histogram group_by when using 2 different time fields #59061

Steuf · 2020-07-06T11:34:01Z

Elasticsearch version : 7.8

Plugins installed: []

JVM version : official docker for release 7.8

OS version : official docker for release 7.8

Description of the problem including expected versus actual behavior:

As asked on forum: https://discuss.elastic.co/t/continous-transforms-accumulating-delay-can-be-tweak-for-speed-up/239807/6

Transforms don't use last checkpoint date on date range calculation when we use differents fields on timestamp aggregation and on sync transforms setting.

Steps to reproduce:

On source index, create two time fields :

@timestamp => The date of the creation on the log on nginx
processed_at => The date of processing the log, added by the ingest (logstash on my case)

Create a continuous transform like this :

{  
  "settings": {
    "max_page_search_size": 10000
  },
  "frequency": "60s",
  "id": "THE_ID",
  "source": {
    "index": [
      "source-*"
    ],
    "query": {
      "bool": {
        "filter": [
          {
             // filter of transform
          }
        ]
      }
    }
  },
  "dest": {
    "index": "destination-index"
  },
  "sync": {
    "time": {
      "field": "processed_at",
      "delay": "5m"
    }
  },
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1h"
        }
      },
      // some other group by
    },
    "aggregations": {
      // some aggregation settings
    }
  }
}

On each checkpoint transform query all index instead of range between old and current checkpoint.

Provide logs (if relevant):

range\":{\"processed_at\":{\"from\":null,\"to\":1594027879901

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-07-06T11:49:22Z

Pinging @elastic/ml-core (:ml/Transform)

hendrikmuhs · 2020-07-08T08:28:56Z

Thanks for the issue,

I changed the title a bit to reflect that this is a performance issue.

The issue relates to #54254, which introduced an optimization for data_histogram. This optimization is only applied if time fields for date_histogram and sync are the same. It wasn't possible to generalize #54254 for different time fields.

The refactoring in PR #58744 however makes this possible, that means it is a step towards solving this problem with a follow up PR after #58744 has been merged.

… ingest timestamps (#63315) optimize continuous data histogram group_by for other time fields independent of sync, this allows the usage of ingest timestamps in continuous mode fixes #59061

Steuf added >bug needs:triage Requires assignment of a team area label labels Jul 6, 2020

hendrikmuhs added the :ml/Transform Transform label Jul 6, 2020

hendrikmuhs removed the needs:triage Requires assignment of a team area label label Jul 6, 2020

hendrikmuhs changed the title ~~Wrong checkpoint range on continuous transforms~~ [Transform] Performance: Unexpected long runtime for date_histogram group_by when using 2 different time fields Jul 8, 2020

hendrikmuhs mentioned this issue Oct 6, 2020

[Transform] improve continuous transform date_histogram group_by with ingest timestamps #63315

Merged

hendrikmuhs closed this as completed in #63315 Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transform] Performance: Unexpected long runtime for date_histogram group_by when using 2 different time fields #59061

[Transform] Performance: Unexpected long runtime for date_histogram group_by when using 2 different time fields #59061

Steuf commented Jul 6, 2020

elasticmachine commented Jul 6, 2020

hendrikmuhs commented Jul 8, 2020

[Transform] Performance: Unexpected long runtime for date_histogram group_by when using 2 different time fields #59061

[Transform] Performance: Unexpected long runtime for date_histogram group_by when using 2 different time fields #59061

Comments

Steuf commented Jul 6, 2020

elasticmachine commented Jul 6, 2020

hendrikmuhs commented Jul 8, 2020