Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Configure a lower scrape interval for Project Contour Envoy service monitor #1194

Closed
niels-s opened this issue Nov 18, 2020 · 4 comments · Fixed by #1229
Closed

Configure a lower scrape interval for Project Contour Envoy service monitor #1194

niels-s opened this issue Nov 18, 2020 · 4 comments · Fixed by #1229
Assignees
Labels
area/components Items related to components kind/enhancement New feature or request

Comments

@niels-s
Copy link
Contributor

niels-s commented Nov 18, 2020

Current situation
At the moment the envoy service monitor doesn't configure a scrape interval so it falls back to the default scrape interval of Prometheus which is 30 seconds.

Impact
Because the scrape interval is so large we suspect it smoothens our latency for that period and we are unable to see any latency spikes. When we compare our own application-specific monitoring we see spikes between 50ms ~ 100ms and occasionally 1s in p99 latency. However, when looking at the envoy latency metrics we can see any spikes at all for p99.

Ideal future situation
Lower the scrape interval of the envoy service monitor so we are able to notice spikes in our latency graphs

Implementation options

  • add a hardcoded lower scrape interval 1s,5s, 10s to the envoy service monitor
  • or make it configurable to specify a scrape interval for project contour component

Additional information
Currently using Lokomotive v0.4.1, however, there are no changes regarding this request in v0.5.0

@invidian
Copy link
Member

@niels-s if you change the scrape interval manually in the ServiceMonitor, does it show up latency spikes?

@invidian invidian added area/components Items related to components kind/enhancement New feature or request labels Nov 18, 2020
@niels-s
Copy link
Contributor Author

niels-s commented Nov 19, 2020

That's a fair question 👍 should have thought of that myself 😅 I've adjusted the interval manually for no, I'll give it some time to gather some data with the new interval and report back.

@niels-s
Copy link
Contributor Author

niels-s commented Nov 19, 2020

To test I increased the interval from the default 30s to the extreme 1s scrape interval. Below I've gathered some screenshots when using a 1m and a 2s range query to display the Envoy data.

This one display the 1m range query, you can see the spikes only go up to 4000ms
CleanShot 2020-11-19 at 11 31 41@2x

While in the example below when we are able to use a smaller time range of 2s, we can see spikes up to 10000ms, where the previous chart would smoothen the spikes to only half the actual latency.
CleanShot 2020-11-19 at 11 31 18@2x

Perhaps scrape interval of 1s would be too aggressive, but the default of 30s smoothens the spikes in this case of latency, but actually all the metrics. So, in general, it would be great if we could lower the scrape interval.

@invidian
Copy link
Member

Thanks for checking it @niels-s. I wonder if there is some other way to find out about those latency spikes without increasing the scrape interval 🤔 Maybe tuning envoy histogram buckets?

@surajssd surajssd added the proposed/next-sprint Issues proposed for next sprint label Nov 25, 2020
@iaguis iaguis removed the proposed/next-sprint Issues proposed for next sprint label Nov 25, 2020
@surajssd surajssd self-assigned this Dec 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/components Items related to components kind/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants