Add `timeseries.instance` to allow automatic downsampling on query #10293

exekias · 2019-01-23T15:20:26Z

This PR introduces a new field that will be attached to events carrying time series: timeseries.instance. The objective of it is to allow queries to downsample metrics from multiple time series while consuming them.

The problem

When querying for time series, bucket size is important, as you may have several occurrences of a metric in the same bucket. When doing aggregations on them, this is something that needs to be taken into account.

Example

With these values for kubernetes.container.memory.usage.bytes:

host	container	t	t+10s	t+20s	t+30s	t+40s	t+50s
host01	busybox	50	56	60	50	54	52
host01	nginx	40	50	46	44	42	44
host02	nginx	10	12	14	16	16	14

A query like "sum of containers memory usage per host" goes:

host	t	t+10s	t+20s	t+30s	t+40s	t+50s
host01	50 + 40	56 + 50	60 + 46	50 + 44	54 + 42	52 + 44
host02	10	12	14	16	16	14

⚠️ But if we change the bucket size, something goes wrong:

host	t	t+20s	t+40s
host01	50 + 40 + 56 + 50	60 + 46 + 50 + 44	54 + 42 + 52 + 44
host02	10 + 12	14 + 16	16 + 14

Containers are double counted as there are more than one value per bucket for the same container. We need a way to overcome this.

You can test this query at demo.elastic.co, change the time range to > 4h to see how data goes wrong (bucket size went about 10s)

TIme series instance id

This PR introduces a timeseries.instance field for Metricbeat documents (we should discuss the need for this in other beats). This field contains a hash of all the k-v dimensions in the document. Dimensions are the set of fields whose combination defines a time series. In this example, dimensions for kubernetes.container.memory.usage.bytes are host and container. This means, each container in the returned documents will have a different and consistent timeseries.instance. All values for the same container will share the same timeseries.instance.

We will then be able to use this information to downsample values in buckets and make sure we only have one per time series before the main aggregation is executed.

Following the previous example, the query would do an average of all values per time series & bucket:

host	timeseries.instance	t	t+20s	t+40s
host01	x	avg(50, 56)	avg(60, 50)	avg(54, 52)
host01	y	avg(40, 50)	avg(46, 44)	avg(42, 44)
host02	z	avg(10, 12)	avg(14, 16)	avg(16, 14)

and then execute the main aggregation (sum):

host	t	t+20s	t+40s
host01	53 + 45	55 + 45	53 + 43
host02	11	15	15

I'm currently treating any keyword field as a dimension, we may need to fine tune this, by enabling/disabling them in fields.yml, but seems to be a good heuristic to start with.

This feature is flagged as experimental, to enable it users need to put this in metricbeat.yml:

timeseries.enabled: true

roncohen · 2019-01-23T15:23:57Z

thanks @exekias! important infrastructure in order to get better operational metrics graphing behavior in place. For example, the WIP metrics explorer will be able to use this under the hood.

exekias · 2019-01-23T15:23:59Z

metricbeat/module/system/core/_meta/fields.yml

@@ -6,6 +6,7 @@
  fields:
    - name: id
      type: long
+      dimension: true


this is an example where flagging the field as dimension was required. id should actually be a keyword here, but we may find other places.

+1 on making it possible to configure fields as dimensions.

jsoriano · 2019-01-23T19:28:44Z

Great, happy to see that we are in the way to have downsampling 😃

We will then be able to use this information to downsample values in buckets and make sure we only have one per time series before the main aggregation is executed.

What is the plan for this? Will there be specific queries? or this info will be used by visualizations?

Following the previous example, the query would do an average of all values per time series & bucket:
host timeseries.instance t t+20s t+40s
host01 x avg(50, 56) avg(60, 50) avg(54, 52)
host01 y avg(40, 50) avg(46, 44) avg(42, 44)
host02 z avg(10, 12) avg(14, 16) avg(16, 14)

Should we make this aggregation configurable too?

I'm currently treating any keyword field as a dimension, we may need to fine tune this, by enabling/disabling them in fields.yml, but seems to be a good heuristic to start with.

This makes sense, I wonder if having too many dimensions can hit performance.

ruflin · 2019-01-23T19:45:12Z

libbeat/_meta/fields.common.yml

@@ -34,3 +34,8 @@
      type: alias
      path: agent.hostname
      migration: true
+
+    - name: timeseries.instance


Definitively a field we should have in ECS.

@exekias Could you open an issue for this in the ECS repo. Their is probably also a good place for naming discussions.

That sounds good to me, would be great to have it documented there, will open as soon as we know we will move this forward 👍

Do we already have an issue for it? Could not find it.

See the TODO on this PR :)

Ups, didn't go through the description again 🤦‍♂️

exekias · 2019-01-24T16:50:28Z

What is the plan for this? Will there be specific queries? or this info will be used by visualizations?

My hope is that the metrics explorer can make use of this field and do it transparently to the user, so they don't really need to care about it.

Should we make this aggregation configurable too?

Maybe?, I think that discussion is better in the UI issue, once we start working on it elastic/kibana#28027

I'm currently treating any keyword field as a dimension, we may need to fine tune this, by enabling/disabling them in fields.yml, but seems to be a good heuristic to start with.

This makes sense, I wonder if having too many dimensions can hit performance.

I think the number of dimensions won't hurt performance, it's the number of time series what may do it. That means that if you do a query that aggregates too many lines, performance may be affected, but I don't think we have a better way without doing changes in Elasticsearch.

ruflin

++ on the concept of the timeseries id.

I assume just setting type: keyword as dimension will not work as there are quite a few fields which are keyword but are not related to the dimension. For example service.version is probably not needed for the dimension. I would start with manual configuration.

Questions:

I would think one of the basic dimentions is the event.dataset field.
For prometheus and others, I assume the labels are defining the dimension
This dimension configs should be identical to what we need for rollups I assume?

ruflin · 2019-01-29T08:57:16Z

libbeat/processors/timeseries/timeseries.go

+			}
+		}
+
+		h, err := hashstructure.Hash(instanceFields, nil)


I wonder if instead of making it a hash we should use the full readable string as a keyword. It would mean concatenating all the field/value pairs together as a keyword. This would make the id at least half readable on what it is which could be useful for debugging.

I considered this option, like maybe just putting in the JSON for all KVs. The problem I see with this string may be too long and pollute the document, WDYT?

When you mentioned JSON it somehow reminded me of elastic/elasticsearch#25312 but I don't think it would apply here as there is not such thing like "group by all these fields".

I'm less worried about polluting the document then about the size of it and if it would have some effect on the overall size. Probably not as the same would show up many times and potentially it even compresses better then a unique hash value. The reason I'm not concerned about polluting the document is that I do not expect that users often look at the raw document and mainly on the metrics. For debugging purpose I definitively think it would be beneficial to have the full string. In case things don't work as expected we would directly see what things are grouped by.

ruflin · 2019-01-29T08:58:55Z

metricbeat/module/system/core/_meta/fields.yml

@@ -6,6 +6,7 @@
  fields:
    - name: id
      type: long
+      dimension: true


+1 on making it possible to configure fields as dimensions.

polyfractal · 2019-01-31T17:33:15Z

Meant to comment on this earlier but got distracted.

I had some concerns about the hashes but @exekias spent a long time chatting with me and I think my concerns were addressed/not real (thanks @exekias!).

I think the only major point is that the nature of the hashes means you'll only be able to use the hashes if you always want to downsample the set of tuples as an individual time series first, before any other action. What I mean is that if you have the tuples [a,b,c] and [a,b,x], they will always hash to different values. Which means you'll never be able to combine the values together as a single "series" [a,b] †.

From my conversation with @exekias I think that's ok, since in this context it won't make sense to combine the datapoints from [a,b] to form one big blob of data before "downsampling".

† Or rather, if you want these tuples you'll have to ignore the hash and fall back to a full set of terms aggs to fully partition each level

Scripting

The only potential suggestion I had is that this hashing behavior could be replicated with a script. E.g. instead pre-hashing the values into a discrete field, the hashes could be generated at runtime with a script in the terms agg (by just concatenating the values into the script output). That'd give the same benefit of hashing upfront, but a bit more flexibility (can change the hash, don't have a dedicated field, etc).

The downside is script execution at runtime, although that should be relatively minor nowadays with painless.

roncohen · 2019-02-26T14:17:59Z

this came up recently with regards to metrics coming in from APM agents. The solution suggested here should also be applied in APM Server to add the timeseries.instance (edit: removed ping as it's a bit too early to do this in APM Server)

instead pre-hashing the values into a discrete field, the hashes could be generated at runtime with a script in the terms agg (by just concatenating the values into the script output)

certainly an option, but you'd have to loop over the key/values in alphabetic order and concatenate "=" for each key/value pair. You'd also have to filter out numeric values.

roncohen · 2019-02-27T15:22:28Z

@ruflin asked:

I would think one of the basic dimentions is the event.dataset field.

Yes. We could make it a "dimension"

For prometheus and others, I assume the labels are defining the dimension

Sounds good to me.

This dimension configs should be identical to what we need for rollups I assume?

I think the story around how rollups relate to this needs some more thought. For example, for counters, roll-ups will need to take the derivative per timeseries.instance before the normal roll up operations.

Something else: this could probably be experimental and opt-in initially, and then enabled by default in a later version. There might be performance implications in generating the hash that we didn't consider.

ruflin · 2019-03-04T13:11:43Z

@roncohen Definitively like the idea to start this as opt-in so we can start playing around with it.

exekias · 2019-04-17T13:36:34Z

@jsoriano @roncohen @ruflin this PR is ready for a second look, I went for making this experimental and disabled by default for now. We can change that at any moment.

jsoriano

LGTM, thanks for adding tests.
I have only some doubts on the support of objects with wildcards, but supporting this can be left for a future change.

libbeat/processors/timeseries/timeseries_test.go

ruflin

Overall LGTM. Looking forward to have this in so we can start to play around with it.

ruflin · 2019-04-24T09:03:11Z

metricbeat/docs/metricbeat-general-options.asciidoc

+
+[source,yaml]
+----
+metricbeat.max_start_delay: 10s


This looks off?

indeed, fixed it

…lastic#10293) * Report time series data to help Kibana query metrics

exekias added enhancement discuss Issue needs further discussion. libbeat Metricbeat Metricbeat Team:Integrations Label for the Integrations team labels Jan 23, 2019

exekias requested review from a team as code owners January 23, 2019 15:20

exekias added in progress Pull request is currently in progress. needs tests needs_docs labels Jan 23, 2019

exekias commented Jan 23, 2019

View reviewed changes

exekias force-pushed the timeseries-instance branch from c5a3f3f to 9d8093e Compare January 23, 2019 16:11

ruflin reviewed Jan 23, 2019

View reviewed changes

exekias force-pushed the timeseries-instance branch from 785de6f to 7c8569c Compare January 24, 2019 17:35

ruflin reviewed Jan 29, 2019

View reviewed changes

roncohen mentioned this pull request Feb 27, 2019

Rebase of #10293 Add timeseries.instance to allow automatic downsampling on query #10963

Closed

exekias self-assigned this Apr 8, 2019

exekias added the [zube]: In Progress label Apr 8, 2019

Carlos Pérez-Aradros Herce added 2 commits April 12, 2019 17:22

Report time series data to help Kibana query metrics

34e695f

Fix prefix list by only adding actual dimensions

8238b29

exekias force-pushed the timeseries-instance branch from 508b727 to 8238b29 Compare April 12, 2019 15:45

exekias removed the needs_docs label Apr 12, 2019

exekias mentioned this pull request Apr 15, 2019

Introduce timeseries.instance to ECS elastic/ecs#431

Open

Carlos Pérez-Aradros Herce added 3 commits April 16, 2019 11:23

Fix imports order

cf97672

Add timeseries.enabled setting to enable/disable the new field

eefa3f1

Add system tests

c9feb18

exekias added needs_docs review and removed in progress Pull request is currently in progress. needs tests labels Apr 17, 2019

Carlos Pérez-Aradros Herce added 2 commits April 17, 2019 15:21

Add docs and experimental warning

98b5ae3

Add changelog

9f21251

exekias force-pushed the timeseries-instance branch from 119635a to 9f21251 Compare April 17, 2019 13:26

Read fields in builder for performance

b50a676

exekias removed the needs_docs label Apr 17, 2019

Update docs

60f8883

exekias added [zube]: In Review and removed [zube]: In Progress labels Apr 17, 2019

jsoriano approved these changes Apr 17, 2019

View reviewed changes

libbeat/processors/timeseries/timeseries_test.go Show resolved Hide resolved

ruflin approved these changes Apr 24, 2019

View reviewed changes

Carlos Pérez-Aradros Herce and others added 3 commits April 24, 2019 12:03

Fix docs

6036c86

Take wildcard keyword objects into account

e111bdd

Merge branch 'master' into timeseries-instance

3bf337a

eyalkoren mentioned this pull request Apr 24, 2019

[APM] Java agent metrics custom API endpoint elastic/kibana#34708

Closed

exekias merged commit f818f80 into elastic:master Apr 24, 2019

zube bot added [zube]: Done and removed [zube]: In Review labels Apr 24, 2019

alvarolobato removed the [zube]: Done label Apr 29, 2019

exekias mentioned this pull request Jun 26, 2019

[Infra UI] Limit "group by" choices for Metrics Explorer UI based on metric selection elastic/kibana#39613

Closed

DStape pushed a commit to DStape/beats that referenced this pull request Aug 20, 2019

Add timeseries.instance to allow automatic downsampling on query (e…

cb2e928

…lastic#10293) * Report time series data to help Kibana query metrics

exekias mentioned this pull request Dec 5, 2019

Refactor rollups meta (AKA Rollup V2) elastic/elasticsearch#42720

Closed

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `timeseries.instance` to allow automatic downsampling on query #10293

Add `timeseries.instance` to allow automatic downsampling on query #10293

exekias commented Jan 23, 2019 •

edited

Loading

roncohen commented Jan 23, 2019 •

edited

Loading

exekias Jan 23, 2019

ruflin Jan 29, 2019

jsoriano commented Jan 23, 2019

ruflin Jan 23, 2019

exekias Jan 24, 2019

ruflin Apr 15, 2019

exekias Apr 15, 2019

ruflin Apr 15, 2019

exekias commented Jan 24, 2019

ruflin left a comment

ruflin Jan 29, 2019

exekias Jan 29, 2019

ruflin Jan 30, 2019

ruflin Jan 29, 2019

polyfractal commented Jan 31, 2019

roncohen commented Feb 26, 2019 •

edited

Loading

roncohen commented Feb 27, 2019 •

edited

Loading

ruflin commented Mar 4, 2019

exekias commented Apr 17, 2019

jsoriano left a comment

ruflin left a comment

ruflin Apr 24, 2019

exekias Apr 24, 2019

Add timeseries.instance to allow automatic downsampling on query #10293

Add timeseries.instance to allow automatic downsampling on query #10293

Conversation

exekias commented Jan 23, 2019 • edited Loading

The problem

Example

TIme series instance id

roncohen commented Jan 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsoriano commented Jan 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

exekias commented Jan 24, 2019

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polyfractal commented Jan 31, 2019

roncohen commented Feb 26, 2019 • edited Loading

roncohen commented Feb 27, 2019 • edited Loading

ruflin commented Mar 4, 2019

exekias commented Apr 17, 2019

jsoriano left a comment

Choose a reason for hiding this comment

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `timeseries.instance` to allow automatic downsampling on query #10293

Add `timeseries.instance` to allow automatic downsampling on query #10293

exekias commented Jan 23, 2019 •

edited

Loading

roncohen commented Jan 23, 2019 •

edited

Loading

roncohen commented Feb 26, 2019 •

edited

Loading

roncohen commented Feb 27, 2019 •

edited

Loading