[Uptime] Snapshot count lags behind actual monitor states #58079

andrewvc · 2020-02-20T02:42:47Z

In Uptime 7.6 using the overhauled snapshot count queries are faster, but count but show too many monitors as down.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

Fixes #58079 This is an improved version of #58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.

Fixes elastic#58079 This is an improved version of elastic#58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.

elasticmachine · 2020-02-24T21:54:58Z

Pinging @elastic/uptime (Team:uptime)

Fixes #58079 This is an improved version of #58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.

…elastic#58389) Fixes elastic#58079 This is an improved version of elastic#58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.

…elastic#58389) (elastic#58415) Fixes elastic#58079 This is an improved version of elastic#58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings. Co-authored-by: Elastic Machine <[email protected]>

This was referenced Feb 20, 2020

[Uptime] Improve snapshot timespan handling #58078

Closed

[Uptime] Use scripted metric for snapshot calculation #58247

Merged

andrewvc mentioned this issue Feb 24, 2020

[Uptime] Use scripted metric for snapshot calculation (#58247) #58389

Merged

7 tasks

tsullivan added the Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability label Feb 24, 2020

andrewvc closed this as completed in #58389 Feb 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Uptime] Snapshot count lags behind actual monitor states #58079

[Uptime] Snapshot count lags behind actual monitor states #58079

andrewvc commented Feb 20, 2020

elasticmachine commented Feb 24, 2020

[Uptime] Snapshot count lags behind actual monitor states #58079

[Uptime] Snapshot count lags behind actual monitor states #58079

Comments

andrewvc commented Feb 20, 2020

elasticmachine commented Feb 24, 2020