Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeseries.instance to allow automatic downsampling on query #10293

Merged
merged 13 commits into from
Apr 24, 2019

Conversation

exekias
Copy link
Contributor

@exekias exekias commented Jan 23, 2019

This PR introduces a new field that will be attached to events carrying time series: timeseries.instance. The objective of it is to allow queries to downsample metrics from multiple time series while consuming them.

The problem

When querying for time series, bucket size is important, as you may have several occurrences of a metric in the same bucket. When doing aggregations on them, this is something that needs to be taken into account.

Example

With these values for kubernetes.container.memory.usage.bytes:

host container t t+10s t+20s t+30s t+40s t+50s
host01 busybox 50 56 60 50 54 52
host01 nginx 40 50 46 44 42 44
host02 nginx 10 12 14 16 16 14

A query like "sum of containers memory usage per host" goes:

host t t+10s t+20s t+30s t+40s t+50s
host01 50 + 40 56 + 50 60 + 46 50 + 44 54 + 42 52 + 44
host02 10 12 14 16 16 14

⚠️ But if we change the bucket size, something goes wrong:

host t t+20s t+40s
host01 50 + 40 + 56 + 50 60 + 46 + 50 + 44 54 + 42 + 52 + 44
host02 10 + 12 14 + 16 16 + 14

Containers are double counted as there are more than one value per bucket for the same container. We need a way to overcome this.

You can test this query at demo.elastic.co, change the time range to > 4h to see how data goes wrong (bucket size went about 10s)

TIme series instance id

This PR introduces a timeseries.instance field for Metricbeat documents (we should discuss the need for this in other beats). This field contains a hash of all the k-v dimensions in the document. Dimensions are the set of fields whose combination defines a time series. In this example, dimensions for kubernetes.container.memory.usage.bytes are host and container. This means, each container in the returned documents will have a different and consistent timeseries.instance. All values for the same container will share the same timeseries.instance.

We will then be able to use this information to downsample values in buckets and make sure we only have one per time series before the main aggregation is executed.

Following the previous example, the query would do an average of all values per time series & bucket:

host timeseries.instance t t+20s t+40s
host01 x avg(50, 56) avg(60, 50) avg(54, 52)
host01 y avg(40, 50) avg(46, 44) avg(42, 44)
host02 z avg(10, 12) avg(14, 16) avg(16, 14)

and then execute the main aggregation (sum):

host t t+20s t+40s
host01 53 + 45 55 + 45 53 + 43
host02 11 15 15

I'm currently treating any keyword field as a dimension, we may need to fine tune this, by enabling/disabling them in fields.yml, but seems to be a good heuristic to start with.

This feature is flagged as experimental, to enable it users need to put this in metricbeat.yml:

timeseries.enabled: true

@exekias exekias added enhancement discuss Issue needs further discussion. libbeat Metricbeat Metricbeat Team:Integrations Label for the Integrations team labels Jan 23, 2019
@exekias exekias requested review from a team as code owners January 23, 2019 15:20
@exekias exekias added in progress Pull request is currently in progress. needs tests needs_docs labels Jan 23, 2019
@roncohen
Copy link
Contributor

roncohen commented Jan 23, 2019

thanks @exekias! important infrastructure in order to get better operational metrics graphing behavior in place. For example, the WIP metrics explorer will be able to use this under the hood.

@@ -6,6 +6,7 @@
fields:
- name: id
type: long
dimension: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an example where flagging the field as dimension was required. id should actually be a keyword here, but we may find other places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on making it possible to configure fields as dimensions.

@exekias exekias force-pushed the timeseries-instance branch from c5a3f3f to 9d8093e Compare January 23, 2019 16:11
@jsoriano
Copy link
Member

Great, happy to see that we are in the way to have downsampling 😃

We will then be able to use this information to downsample values in buckets and make sure we only have one per time series before the main aggregation is executed.

What is the plan for this? Will there be specific queries? or this info will be used by visualizations?

Following the previous example, the query would do an average of all values per time series & bucket:
host timeseries.instance t t+20s t+40s
host01 x avg(50, 56) avg(60, 50) avg(54, 52)
host01 y avg(40, 50) avg(46, 44) avg(42, 44)
host02 z avg(10, 12) avg(14, 16) avg(16, 14)

Should we make this aggregation configurable too?

I'm currently treating any keyword field as a dimension, we may need to fine tune this, by enabling/disabling them in fields.yml, but seems to be a good heuristic to start with.

This makes sense, I wonder if having too many dimensions can hit performance.

@@ -34,3 +34,8 @@
type: alias
path: agent.hostname
migration: true

- name: timeseries.instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitively a field we should have in ECS.

@exekias Could you open an issue for this in the ECS repo. Their is probably also a good place for naming discussions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me, would be great to have it documented there, will open as soon as we know we will move this forward 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already have an issue for it? Could not find it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the TODO on this PR :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, didn't go through the description again 🤦‍♂️

@exekias
Copy link
Contributor Author

exekias commented Jan 24, 2019

What is the plan for this? Will there be specific queries? or this info will be used by visualizations?

My hope is that the metrics explorer can make use of this field and do it transparently to the user, so they don't really need to care about it.

Should we make this aggregation configurable too?

Maybe?, I think that discussion is better in the UI issue, once we start working on it elastic/kibana#28027

I'm currently treating any keyword field as a dimension, we may need to fine tune this, by enabling/disabling them in fields.yml, but seems to be a good heuristic to start with.

This makes sense, I wonder if having too many dimensions can hit performance.

I think the number of dimensions won't hurt performance, it's the number of time series what may do it. That means that if you do a query that aggregates too many lines, performance may be affected, but I don't think we have a better way without doing changes in Elasticsearch.

@exekias exekias force-pushed the timeseries-instance branch from 785de6f to 7c8569c Compare January 24, 2019 17:35
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ on the concept of the timeseries id.

I assume just setting type: keyword as dimension will not work as there are quite a few fields which are keyword but are not related to the dimension. For example service.version is probably not needed for the dimension. I would start with manual configuration.

Questions:

  • I would think one of the basic dimentions is the event.dataset field.
  • For prometheus and others, I assume the labels are defining the dimension
  • This dimension configs should be identical to what we need for rollups I assume?

}
}

h, err := hashstructure.Hash(instanceFields, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead of making it a hash we should use the full readable string as a keyword. It would mean concatenating all the field/value pairs together as a keyword. This would make the id at least half readable on what it is which could be useful for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this option, like maybe just putting in the JSON for all KVs. The problem I see with this string may be too long and pollute the document, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you mentioned JSON it somehow reminded me of elastic/elasticsearch#25312 but I don't think it would apply here as there is not such thing like "group by all these fields".

I'm less worried about polluting the document then about the size of it and if it would have some effect on the overall size. Probably not as the same would show up many times and potentially it even compresses better then a unique hash value. The reason I'm not concerned about polluting the document is that I do not expect that users often look at the raw document and mainly on the metrics. For debugging purpose I definitively think it would be beneficial to have the full string. In case things don't work as expected we would directly see what things are grouped by.

@@ -6,6 +6,7 @@
fields:
- name: id
type: long
dimension: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on making it possible to configure fields as dimensions.

@polyfractal
Copy link

Meant to comment on this earlier but got distracted.

I had some concerns about the hashes but @exekias spent a long time chatting with me and I think my concerns were addressed/not real (thanks @exekias!).

I think the only major point is that the nature of the hashes means you'll only be able to use the hashes if you always want to downsample the set of tuples as an individual time series first, before any other action. What I mean is that if you have the tuples [a,b,c] and [a,b,x], they will always hash to different values. Which means you'll never be able to combine the values together as a single "series" [a,b] †.

From my conversation with @exekias I think that's ok, since in this context it won't make sense to combine the datapoints from [a,b] to form one big blob of data before "downsampling".

Or rather, if you want these tuples you'll have to ignore the hash and fall back to a full set of terms aggs to fully partition each level

Scripting

The only potential suggestion I had is that this hashing behavior could be replicated with a script. E.g. instead pre-hashing the values into a discrete field, the hashes could be generated at runtime with a script in the terms agg (by just concatenating the values into the script output). That'd give the same benefit of hashing upfront, but a bit more flexibility (can change the hash, don't have a dedicated field, etc).

The downside is script execution at runtime, although that should be relatively minor nowadays with painless.

@roncohen
Copy link
Contributor

roncohen commented Feb 26, 2019

this came up recently with regards to metrics coming in from APM agents. The solution suggested here should also be applied in APM Server to add the timeseries.instance (edit: removed ping as it's a bit too early to do this in APM Server)

instead pre-hashing the values into a discrete field, the hashes could be generated at runtime with a script in the terms agg (by just concatenating the values into the script output)

certainly an option, but you'd have to loop over the key/values in alphabetic order and concatenate "=" for each key/value pair. You'd also have to filter out numeric values.

@roncohen
Copy link
Contributor

roncohen commented Feb 27, 2019

@ruflin asked:

I would think one of the basic dimentions is the event.dataset field.

Yes. We could make it a "dimension"

For prometheus and others, I assume the labels are defining the dimension

Sounds good to me.

This dimension configs should be identical to what we need for rollups I assume?

I think the story around how rollups relate to this needs some more thought. For example, for counters, roll-ups will need to take the derivative per timeseries.instance before the normal roll up operations.

Something else: this could probably be experimental and opt-in initially, and then enabled by default in a later version. There might be performance implications in generating the hash that we didn't consider.

@ruflin
Copy link
Contributor

ruflin commented Mar 4, 2019

@roncohen Definitively like the idea to start this as opt-in so we can start playing around with it.

@exekias exekias self-assigned this Apr 8, 2019
@exekias exekias force-pushed the timeseries-instance branch from 508b727 to 8238b29 Compare April 12, 2019 15:45
@exekias exekias added needs_docs review and removed in progress Pull request is currently in progress. needs tests labels Apr 17, 2019
@exekias exekias force-pushed the timeseries-instance branch from 119635a to 9f21251 Compare April 17, 2019 13:26
@exekias
Copy link
Contributor Author

exekias commented Apr 17, 2019

@jsoriano @roncohen @ruflin this PR is ready for a second look, I went for making this experimental and disabled by default for now. We can change that at any moment.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding tests.
I have only some doubts on the support of objects with wildcards, but supporting this can be left for a future change.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Looking forward to have this in so we can start to play around with it.


[source,yaml]
----
metricbeat.max_start_delay: 10s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks off?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, fixed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants