Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] Performance: Unexpected long runtime for date_histogram group_by when using 2 different time fields #59061

Closed
Steuf opened this issue Jul 6, 2020 · 2 comments · Fixed by #63315
Labels

Comments

@Steuf
Copy link

Steuf commented Jul 6, 2020

Elasticsearch version : 7.8

Plugins installed: []

JVM version : official docker for release 7.8

OS version : official docker for release 7.8

Description of the problem including expected versus actual behavior:

As asked on forum: https://discuss.elastic.co/t/continous-transforms-accumulating-delay-can-be-tweak-for-speed-up/239807/6

Transforms don't use last checkpoint date on date range calculation when we use differents fields on timestamp aggregation and on sync transforms setting.

Steps to reproduce:

On source index, create two time fields :

  • @timestamp => The date of the creation on the log on nginx
  • processed_at => The date of processing the log, added by the ingest (logstash on my case)

Create a continuous transform like this :

{  
  "settings": {
    "max_page_search_size": 10000
  },
  "frequency": "60s",
  "id": "THE_ID",
  "source": {
    "index": [
      "source-*"
    ],
    "query": {
      "bool": {
        "filter": [
          {
             // filter of transform
          }
        ]
      }
    }
  },
  "dest": {
    "index": "destination-index"
  },
  "sync": {
    "time": {
      "field": "processed_at",
      "delay": "5m"
    }
  },
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1h"
        }
      },
      // some other group by
    },
    "aggregations": {
      // some aggregation settings
    }
  }
}

On each checkpoint transform query all index instead of range between old and current checkpoint.

Provide logs (if relevant):

range\":{\"processed_at\":{\"from\":null,\"to\":1594027879901

@Steuf Steuf added >bug needs:triage Requires assignment of a team area label labels Jul 6, 2020
@hendrikmuhs hendrikmuhs added the :ml/Transform Transform label Jul 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml/Transform)

@hendrikmuhs hendrikmuhs removed the needs:triage Requires assignment of a team area label label Jul 6, 2020
@hendrikmuhs hendrikmuhs changed the title Wrong checkpoint range on continuous transforms [Transform] Performance: Unexpected long runtime for date_histogram group_by when using 2 different time fields Jul 8, 2020
@hendrikmuhs
Copy link

Thanks for the issue,

I changed the title a bit to reflect that this is a performance issue.

The issue relates to #54254, which introduced an optimization for data_histogram. This optimization is only applied if time fields for date_histogram and sync are the same. It wasn't possible to generalize #54254 for different time fields.

The refactoring in PR #58744 however makes this possible, that means it is a step towards solving this problem with a follow up PR after #58744 has been merged.

hendrikmuhs pushed a commit that referenced this issue Oct 16, 2020
… ingest timestamps (#63315)

optimize continuous data histogram group_by for other time fields independent
of sync, this allows the usage of ingest timestamps in continuous mode

fixes #59061
hendrikmuhs pushed a commit that referenced this issue Oct 16, 2020
… ingest timestamps (#63315)

optimize continuous data histogram group_by for other time fields independent
of sync, this allows the usage of ingest timestamps in continuous mode

fixes #59061
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants