Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing system time while Telegraf is running produces wrong round_interval timestamps #5335

Closed
Frakenz opened this issue Jan 23, 2019 · 5 comments · Fixed by #7390
Closed
Assignees
Labels
area/agent bug unexpected problem or unintended behavior
Milestone

Comments

@Frakenz
Copy link
Contributor

Frakenz commented Jan 23, 2019

Relevant telegraf.conf:

interval = "1m"
round_interval = true

flush_interval = "1m"

System info:

CentOS 7
Telegraf 1.9.3 (git: HEAD 6ad8c8b)
Telegraf 1.9.1 (git: HEAD 2063609) (on a second VM)
InfluxDB shell version: 1.7.3

Steps to reproduce:

  1. Make sure Telegraf is running and feeding data to InfluxDB.
  2. Check current system time with date
  3. Change the time to a time with a different second mark. For example sudo date +%T -s "17:55:32" (make sure current time is not close to being 32 to avoid not changing the second mark)
  4. Wait for your interval and flush interval to cycle once.
  5. Enter InfluxDB through the CLI and use the appropriate database.
  6. precision rfc3339
  7. Make a simple query like select pid from procstat where time > now() - 5m

Expected behavior:

2019-01-23T16:28:00Z 4944
2019-01-23T16:28:00Z 2796
2019-01-23T16:28:00Z 2115
2019-01-23T16:28:00Z 2169
2019-01-23T17:56:00Z 2115

Actual behavior:

2019-01-23T16:28:00Z 4944
2019-01-23T16:28:00Z 2796
2019-01-23T16:28:00Z 2115
2019-01-23T16:28:00Z 2169
2019-01-23T17:56:16Z 2115

Additional info:

Restarting Telegraf fixes this issue. I found this because I was running some VMs on a computer which was put on power saving mode during lunchtime, and the VMs clocks desynchronized. Using chronyc makestep produced this unexpected behavior.
This is probably the root cause of other similar issues, because if a system clock desynchronizes itself from official time and is later updated by chronyd, Telegraf will start recording metrics with the wrong timestamps.
One problem this causes is making a query which uses group by time between 2019-01-23T17:50:00Z and 2019-01-23T17:56:00Z with fill(0) can give you false nulls if there is other data with correct timestamps.

@danielnelson
Copy link
Contributor

Related to #4652 and #4538, the fix for all of these could be to move to a ticker that works with absolute times instead of only durations.

@danielnelson danielnelson added bug unexpected problem or unintended behavior area/agent labels Jan 23, 2019
@Frakenz
Copy link
Contributor Author

Frakenz commented Jan 24, 2019

@danielnelson as you suggested in #4538, setting precision = "60s" or precision = "1m" prevents some side effects of this issue (it sets all timestamps to HH:MM:00 which is what I need to group by time). I was just using the default precision before (not set).
If Telegraf rounds time marks correctly, setting the right precision might avoid this issue entirely.
How does time rounding work?

@danielnelson
Copy link
Contributor

The time is rounded to nearest multiple of the precision using Time.Round after the metric has been collected, with half way values rounded up.

@Frakenz
Copy link
Contributor Author

Frakenz commented Jan 24, 2019

Thank you, I can work with that for now. I still think this is a bug though because given smaller precisions time is more likely to drift and mess with timestamps after being updated by chronyd, so I'm not going to close this myself.
This is a great tool and I'm happy to see that you and your team reply so quickly

@gregvolk
Copy link
Contributor

I just wanted to chime in and say that I've hit this issue now too having rolled Telegraf 1.9.4 out to ~160 Win2016 servers. In my case it is definitely input plug-in dependent because I'm using inputs.mem to get system-wide memory used percent and that metric data never comes in with this timestamp problem but metrics from win_perf_counters do intermittently have this issue.

Examples:
Available_Bytes from win_perf_counters, notice the timestamp oscillating between nnnnnnnnn9 and nnnnnnnnn0:

telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4354543616 1558451819
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4368003072 1558451880
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4405002240 1558451939
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4411219968 1558452000
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4440842240 1558452059
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4410294272 1558452120
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4414660608 1558452179
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4582981632 1558452240
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4564197376 1558452299
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4548448256 1558452360
telegraf.windows.server-1a178e7f.Memory.win_mem.Available_Bytes 4537679872 1558452419

The timestamps for Memory used_percent (from inputs.mem) always consistently end in 9 for the same intervals:

telegraf.windows.server-1a178e7f.Memory.used_percent 49.30262412470363 1558451819
telegraf.windows.server-1a178e7f.Memory.used_percent 49.14058342282609 1558451879
telegraf.windows.server-1a178e7f.Memory.used_percent 48.715405406952016 1558451939
telegraf.windows.server-1a178e7f.Memory.used_percent 48.64301641770831 1558451999
telegraf.windows.server-1a178e7f.Memory.used_percent 48.29814344655518 1558452059
telegraf.windows.server-1a178e7f.Memory.used_percent 48.65484281384299 1558452119
telegraf.windows.server-1a178e7f.Memory.used_percent 48.602959269510215 1558452179
telegraf.windows.server-1a178e7f.Memory.used_percent 46.64254336186293 1558452239
telegraf.windows.server-1a178e7f.Memory.used_percent 46.86199930949106 1558452299
telegraf.windows.server-1a178e7f.Memory.used_percent 47.04745436823201 1558452359
telegraf.windows.server-1a178e7f.Memory.used_percent 47.17072547310353 1558452419

I'm going to experiment with the precision config parm to see if that successfully works around this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants