Changing system time while Telegraf is running produces wrong round_interval timestamps #5335

Frakenz · 2019-01-23T18:23:30Z

Relevant telegraf.conf:

interval = "1m"
round_interval = true

flush_interval = "1m"

System info:

CentOS 7
Telegraf 1.9.3 (git: HEAD 6ad8c8b)
Telegraf 1.9.1 (git: HEAD 2063609) (on a second VM)
InfluxDB shell version: 1.7.3

Steps to reproduce:

Make sure Telegraf is running and feeding data to InfluxDB.
Check current system time with date
Change the time to a time with a different second mark. For example sudo date +%T -s "17:55:32" (make sure current time is not close to being 32 to avoid not changing the second mark)
Wait for your interval and flush interval to cycle once.
Enter InfluxDB through the CLI and use the appropriate database.
precision rfc3339
Make a simple query like select pid from procstat where time > now() - 5m

Expected behavior:

2019-01-23T16:28:00Z 4944
2019-01-23T16:28:00Z 2796
2019-01-23T16:28:00Z 2115
2019-01-23T16:28:00Z 2169
2019-01-23T17:56:00Z 2115

Actual behavior:

2019-01-23T16:28:00Z 4944
2019-01-23T16:28:00Z 2796
2019-01-23T16:28:00Z 2115
2019-01-23T16:28:00Z 2169
2019-01-23T17:56:16Z 2115

Additional info:

Restarting Telegraf fixes this issue. I found this because I was running some VMs on a computer which was put on power saving mode during lunchtime, and the VMs clocks desynchronized. Using chronyc makestep produced this unexpected behavior.
This is probably the root cause of other similar issues, because if a system clock desynchronizes itself from official time and is later updated by chronyd, Telegraf will start recording metrics with the wrong timestamps.
One problem this causes is making a query which uses group by time between 2019-01-23T17:50:00Z and 2019-01-23T17:56:00Z with fill(0) can give you false nulls if there is other data with correct timestamps.

The text was updated successfully, but these errors were encountered:

danielnelson · 2019-01-23T18:50:16Z

Related to #4652 and #4538, the fix for all of these could be to move to a ticker that works with absolute times instead of only durations.

Frakenz · 2019-01-24T13:59:26Z

@danielnelson as you suggested in #4538, setting precision = "60s" or precision = "1m" prevents some side effects of this issue (it sets all timestamps to HH:MM:00 which is what I need to group by time). I was just using the default precision before (not set).
If Telegraf rounds time marks correctly, setting the right precision might avoid this issue entirely.
How does time rounding work?

danielnelson · 2019-01-24T19:10:27Z

The time is rounded to nearest multiple of the precision using Time.Round after the metric has been collected, with half way values rounded up.

Frakenz · 2019-01-24T19:29:24Z

Thank you, I can work with that for now. I still think this is a bug though because given smaller precisions time is more likely to drift and mess with timestamps after being updated by chronyd, so I'm not going to close this myself.
This is a great tool and I'm happy to see that you and your team reply so quickly

gregvolk · 2019-05-21T16:51:54Z

I just wanted to chime in and say that I've hit this issue now too having rolled Telegraf 1.9.4 out to ~160 Win2016 servers. In my case it is definitely input plug-in dependent because I'm using inputs.mem to get system-wide memory used percent and that metric data never comes in with this timestamp problem but metrics from win_perf_counters do intermittently have this issue.

Examples:
Available_Bytes from win_perf_counters, notice the timestamp oscillating between nnnnnnnnn9 and nnnnnnnnn0:

telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4354543616 1558451819
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4368003072 1558451880
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4405002240 1558451939
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4411219968 1558452000
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4440842240 1558452059
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4410294272 1558452120
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4414660608 1558452179
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4582981632 1558452240
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4564197376 1558452299
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4548448256 1558452360
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4537679872 1558452419

The timestamps for Memory used_percent (from inputs.mem) always consistently end in 9 for the same intervals:

telegraf.windows.server-1a178e7f.Memory.used_percent 49.30262412470363 1558451819
telegraf.windows.server-1a178e7f.Memory.used_percent 49.14058342282609 1558451879
telegraf.windows.server-1a178e7f.Memory.used_percent 48.715405406952016 1558451939
telegraf.windows.server-1a178e7f.Memory.used_percent 48.64301641770831 1558451999
telegraf.windows.server-1a178e7f.Memory.used_percent 48.29814344655518 1558452059
telegraf.windows.server-1a178e7f.Memory.used_percent 48.65484281384299 1558452119
telegraf.windows.server-1a178e7f.Memory.used_percent 48.602959269510215 1558452179
telegraf.windows.server-1a178e7f.Memory.used_percent 46.64254336186293 1558452239
telegraf.windows.server-1a178e7f.Memory.used_percent 46.86199930949106 1558452299
telegraf.windows.server-1a178e7f.Memory.used_percent 47.04745436823201 1558452359
telegraf.windows.server-1a178e7f.Memory.used_percent 47.17072547310353 1558452419

I'm going to experiment with the precision config parm to see if that successfully works around this issue.

danielnelson added bug unexpected problem or unintended behavior area/agent labels Jan 23, 2019

danielnelson self-assigned this Sep 24, 2019

danielnelson mentioned this issue Apr 23, 2020

Rework plugin tickers to prevent drift and spread write ticks #7390

Merged

3 tasks

danielnelson closed this as completed in #7390 May 6, 2020

danielnelson added this to the 1.15.0 milestone May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing system time while Telegraf is running produces wrong round_interval timestamps #5335

Changing system time while Telegraf is running produces wrong round_interval timestamps #5335

Frakenz commented Jan 23, 2019 •

edited

Loading

danielnelson commented Jan 23, 2019

Frakenz commented Jan 24, 2019

danielnelson commented Jan 24, 2019

Frakenz commented Jan 24, 2019

gregvolk commented May 21, 2019

Changing system time while Telegraf is running produces wrong round_interval timestamps #5335

Changing system time while Telegraf is running produces wrong round_interval timestamps #5335

Comments

Frakenz commented Jan 23, 2019 • edited Loading

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Jan 23, 2019

Frakenz commented Jan 24, 2019

danielnelson commented Jan 24, 2019

Frakenz commented Jan 24, 2019

gregvolk commented May 21, 2019

Frakenz commented Jan 23, 2019 •

edited

Loading