StatsD Input: Add option for threaded parsing of metrics #6919

josephpeacock · 2020-01-17T19:33:48Z

Feature Request

Opening a feature request kicks off a discussion.

Proposal:

Add a configuration options (possibly max_parser_threads) to the StatsD Input plugin config to allow user configuration of the number of goroutines to spin up for parsing inputs off of the input channel to increase throughput.

Current behavior:

As per my reading, every incoming messages adds to the s.in channel (which is of size allowed_pending_messages), and only the parser() function is pulling messages off of the channel.
The parser is run using a goroutine here:

telegraf/plugins/inputs/statsd/statsd.go

Line 395 in d7b3f1f

go func() {

- but it's only spinning off a single goroutine.

Desired behavior:

Spin up n goroutines to do the listening instead of 1. But still default to the current behavior.

Use case:

My team is seeing consistent message dropping from our telegraph instances using the statsD input plugin. We're seeing the log message [inputs.statsd] Error: statsd message queue full. We have dropped <num> messages so far. You may want to increase allowed_pending_messages in the config

Our issue, I believe, is due to the fact that the parser isn't able to keep up with incoming message volume. We'd like to improve its throughput.

If people have other suggestions for configuration tweaks we can make to improve throughput, we'd also love to hear those!

The text was updated successfully, but these errors were encountered:

danielnelson · 2020-01-17T20:36:25Z

I'm assuming you are listening on UDP? If so, would you be able to run a development build (#6921) with some extra internal metrics? Let me know you platform and I can link to the CI build.

To enable the internal metrics you will need to add this plugin:

[[inputs.internal]]
  collect_memstats = false

After restarting, I'd like to look at these two queries:

SELECT non_negative_derivative(last("udp_packets_received"), 1s) AS "udp_packets_received", non_negative_derivative(last("udp_packets_dropped"), 1s) AS "udp_packets_dropped", non_negative_derivative(last("udp_bytes_received"), 1s) AS "udp_bytes_received", last("parse_time_ns") AS "parse_time_ns" FROM "internal_statsd" WHERE time > now() - 10m GROUP BY time(10s), "host"

SELECT non_negative_derivative(max("udp_rcvbuferrors"), 10s) AS "udp_rcvbuferrors" FROM "net" WHERE time > now() - 10m GROUP BY time(10s), "host"

josephpeacock · 2020-01-17T22:11:50Z

We can't run a development build in our customer-facing environments, but we've been able to reproduce our issues locally. Do you have a Mac executable built? Looking at your PR, I only see Linux/Windows commented. Thanks!

**Edit: Found it. Will respond back with what we find from your queries.

josephpeacock · 2020-01-17T22:36:49Z

Alright, here's what we've got.

With the disclaimer that this was run locally, not taken from a deployed environment. However, the behavior we've reproduced is essentially identical (just with smaller values) to what we've seen in production.

First query:

name: internal_statsd
tags: host=telegraf_lo
time                udp_packets_received udp_packets_dropped udp_bytes_received parse_time_ns
----                -------------------- ------------------- ------------------ -------------
1579299830000000000
1579299840000000000                                                             0
1579299850000000000 0.1                  0                   0.1                1444221
1579299860000000000 0.2                  0                   0.2                176601
1579299870000000000 0.2                  0                   0.2                5837172
1579299880000000000
1579299890000000000
1579299900000000000                      0                                      0
1579299910000000000 0.1                  0                   0.1                329106
1579299920000000000 0.2                  0                   0.2                118554
1579299930000000000 0.3                  0                   0.3                110259
1579299940000000000 0.2                  0                   0.2                128574
1579299950000000000 0.2                  0                   0.2                132330
1579299960000000000 0.2                  0                   0.2                129079
1579299970000000000 0.4                  0                   0.4                109403
1579299980000000000 86.9                 0.8                 86.9               136355
1579299990000000000 324.4                0.6                 324.4              78158
1579300000000000000 305.8                1.5                 305.8              124558
1579300010000000000 227.3                1.5                 227.3              75316
1579300020000000000 163.7                2.3                 163.7              77196
1579300030000000000 172.1                1.6                 172.1              52677
1579300040000000000 0.2                  0                   0.2                147713
1579300050000000000 0.3                  0                   0.3                123897
1579300060000000000 30.2                 0                   30.2               162704
1579300070000000000 271.5                1.9                 271.5              47466
1579300080000000000 283.2                0.9                 283.2              39800
1579300090000000000 293.7                3.4                 293.7              43143
1579300100000000000 183.4                1                   183.4              121908
1579300110000000000 174.3                1.6                 174.3              124010
1579300120000000000 246                  3                   246                71273
1579300130000000000 229.3                2.1                 229.3              91241
1579300140000000000 117.8                2.1                 117.8              83130
1579300150000000000
1579300160000000000

The second query returned zero results.

Telegraf debug log:

2020-01-17T22:25:08Z I! Loaded aggregators:
2020-01-17T22:25:08Z I! Loaded processors:
2020-01-17T22:25:08Z I! Loaded outputs: file influxdb
2020-01-17T22:25:08Z I! Tags enabled: host=telegraf_lo
2020-01-17T22:25:08Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf_lo", Flush Interval:10s
2020-01-17T22:25:08Z D! [agent] Initializing plugins
2020-01-17T22:25:08Z D! [agent] Connecting outputs
2020-01-17T22:25:08Z D! [agent] Attempting connection to [outputs.file]
2020-01-17T22:25:08Z D! [agent] Successfully connected to outputs.file
2020-01-17T22:25:08Z D! [agent] Attempting connection to [outputs.influxdb]
2020-01-17T22:25:08Z D! [agent] Successfully connected to outputs.influxdb
2020-01-17T22:25:08Z D! [agent] Starting service inputs
2020-01-17T22:25:08Z W! [inputs.statsd] 'parse_data_dog_tags' config option is deprecated, please use 'datadog_extensions' instead
2020-01-17T22:25:08Z I! [inputs.statsd] UDP listening on "[::]:6999"
2020-01-17T22:25:08Z I! [inputs.statsd] Started the statsd service on ":6999"
2020-01-17T22:25:18Z D! [outputs.file] Wrote batch of 8 metrics in 1.841292ms
2020-01-17T22:25:18Z D! [outputs.file] Buffer fullness: 29 / 10000 metrics
2020-01-17T22:25:18Z D! [outputs.influxdb] Wrote batch of 8 metrics in 6.855536ms
2020-01-17T22:25:18Z D! [outputs.influxdb] Buffer fullness: 29 / 10000 metrics
2020-01-17T22:25:28Z D! [outputs.file] Wrote batch of 37 metrics in 844.062µs
2020-01-17T22:25:28Z D! [outputs.file] Buffer fullness: 41 / 10000 metrics
2020-01-17T22:25:28Z D! [outputs.influxdb] Wrote batch of 29 metrics in 10.484869ms
2020-01-17T22:25:28Z D! [outputs.influxdb] Buffer fullness: 49 / 10000 metrics
2020-01-17T22:25:38Z D! [outputs.file] Wrote batch of 49 metrics in 648.394µs
2020-01-17T22:25:38Z D! [outputs.file] Buffer fullness: 43 / 10000 metrics
2020-01-17T22:25:38Z D! [outputs.influxdb] Wrote batch of 49 metrics in 10.534191ms
2020-01-17T22:25:38Z D! [outputs.influxdb] Buffer fullness: 51 / 10000 metrics
2020-01-17T22:25:48Z D! [outputs.file] Wrote batch of 51 metrics in 1.044176ms
2020-01-17T22:25:48Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2020-01-17T22:25:48Z D! [outputs.influxdb] Wrote batch of 51 metrics in 13.887656ms
2020-01-17T22:25:48Z D! [outputs.influxdb] Buffer fullness: 50 / 10000 metrics
2020-01-17T22:25:58Z D! [outputs.file] Wrote batch of 50 metrics in 870.746µs
2020-01-17T22:25:58Z D! [outputs.file] Buffer fullness: 41 / 10000 metrics
2020-01-17T22:25:58Z D! [outputs.influxdb] Wrote batch of 50 metrics in 4.694908ms
2020-01-17T22:25:58Z D! [outputs.influxdb] Buffer fullness: 49 / 10000 metrics
2020-01-17T22:26:08Z D! [outputs.file] Wrote batch of 49 metrics in 1.194743ms
2020-01-17T22:26:08Z D! [outputs.file] Buffer fullness: 41 / 10000 metrics
2020-01-17T22:26:08Z D! [outputs.influxdb] Wrote batch of 49 metrics in 6.590411ms
2020-01-17T22:26:08Z D! [outputs.influxdb] Buffer fullness: 49 / 10000 metrics
2020-01-17T22:26:18Z D! [outputs.file] Wrote batch of 49 metrics in 1.055582ms
2020-01-17T22:26:18Z D! [outputs.file] Buffer fullness: 48 / 10000 metrics
2020-01-17T22:26:18Z D! [outputs.influxdb] Wrote batch of 49 metrics in 7.748073ms
2020-01-17T22:26:18Z D! [outputs.influxdb] Buffer fullness: 56 / 10000 metrics
2020-01-17T22:26:27Z E! [inputs.statsd] Statsd message queue full. We have dropped 1 messages so far. You may want to increase allowed_pending_messages in the config
2020-01-17T22:26:28Z D! [outputs.file] Wrote batch of 48 metrics in 1.819311ms
2020-01-17T22:26:28Z D! [outputs.file] Buffer fullness: 58 / 10000 metrics
2020-01-17T22:26:28Z D! [outputs.influxdb] Wrote batch of 106 metrics in 23.516519ms
2020-01-17T22:26:28Z D! [outputs.influxdb] Buffer fullness: 8 / 10000 metrics
2020-01-17T22:26:28Z E! [inputs.statsd] Statsd message queue full. We have dropped 10 messages so far. You may want to increase allowed_pending_messages in the config
2020-01-17T22:26:38Z D! [outputs.influxdb] Wrote batch of 8 metrics in 4.367522ms
2020-01-17T22:26:38Z D! [outputs.influxdb] Buffer fullness: 11 / 10000 metrics
2020-01-17T22:26:38Z D! [outputs.file] Wrote batch of 66 metrics in 1.96884ms
2020-01-17T22:26:38Z D! [outputs.file] Buffer fullness: 51 / 10000 metrics
2020-01-17T22:26:40Z E! [inputs.statsd] Statsd message queue full. We have dropped 20 messages so far. You may want to increase allowed_pending_messages in the config
2020-01-17T22:26:48Z D! [outputs.file] Wrote batch of 59 metrics in 1.265597ms
2020-01-17T22:26:48Z D! [outputs.file] Buffer fullness: 51 / 10000 metrics
2020-01-17T22:26:48Z D! [outputs.influxdb] Wrote batch of 59 metrics in 19.375943ms
2020-01-17T22:26:48Z D! [outputs.influxdb] Buffer fullness: 59 / 10000 metrics
2020-01-17T22:26:48Z E! [inputs.statsd] Statsd message queue full. We have dropped 30 messages so far. You may want to increase allowed_pending_messages in the config
2020-01-17T22:26:52Z E! [inputs.statsd] Statsd message queue full. We have dropped 40 messages so far. You may want to increase allowed_pending_messages in the config
2020-01-17T22:26:58Z D! [outputs.file] Wrote batch of 59 metrics in 1.083381ms
2020-01-17T22:26:58Z D! [outputs.file] Buffer fullness: 51 / 10000 metrics
2020-01-17T22:26:58Z E! [inputs.statsd] Statsd message queue full. We have dropped 50 messages so far. You may want to increase allowed_pending_messages in the config
2020-01-17T22:26:58Z D! [outputs.influxdb] Wrote batch of 59 metrics in 69.761106ms
2020-01-17T22:26:58Z D! [outputs.influxdb] Buffer fullness: 59 / 10000 metrics
2020-01-17T22:27:07Z E! [inputs.statsd] Statsd message queue full. We have dropped 60 messages so far. You may want to increase allowed_pending_messages in the config

telegraf.conf:

  interval = “10s”
  round_interval = false
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = “0s”
  flush_interval = “10s”
  flush_jitter = “0s”
  precision = “”
  debug = true
  quiet = false
  logfile = “/var/log/telegraf/telegraf.log”
  hostname = “telegraf_lo”
  omit_hostname = false
# # Statsd Server
[[inputs.statsd]]
    protocol = “udp”
    tcp_keep_alive = false
    service_address = “:6999”
    delete_gauges = true
    delete_counters = true
    delete_sets = true
    delete_timings = true
    percentiles = [50,75,95,99]
    metric_separator = “_”
    parse_data_dog_tags = true
    datadog_extensions = true
    allowed_pending_messages = 10
    percentile_limit = 1000
[[inputs.internal]]
  collect_memstats = false
[[outputs.file]]
  ## Files to write to, “stdout” is a specially handled file.
  files = [“/var/log/telegraf/out/metrics.out”]
  data_format = “wavefront”

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
  urls = [“http://localhost:8086”]
  database = “telegraf”
  skip_database_creation = false
   timeout = “5s”
   username = “telegraf”
   password = “thisisanobviouslyfakepassword”

danielnelson · 2020-01-18T00:05:51Z

It looks like my code handled udp_bytes_received incorrectly, so we will have to ignore those right now.

The second query would also require the net plugin on Linux, but these are not implemented on Darwin. It can use it though to see if you are dropping statsd messages before Telegraf is able to read them off the socket.

[[inputs.net]]
  interfaces = ["eth0"]
  ignore_protocol_stats = false

Cherry picking a few items for reference:

name: internal_statsd
tags: host=telegraf_lo
time                udp_packets_received udp_packets_dropped udp_bytes_received parse_time_ns
----                -------------------- ------------------- ------------------ -------------
1579300020000000000 163.7                2.3                 163.7              77196
1579300060000000000 30.2                 0                   30.2               162704

The plugin should be able to parse 1s / parse_time_ns worth of packets per second, for these two lines:

1000000000 / 77196 = 13277.39
1000000000 / 162704 = 6146.13

The packets received is much lower here, though it is hard to say if this holds with your production systems, so I expect having multiple parse goroutines will probably help a small amount but not really solve the issue, since the problem isn't throughput its dealing with traffic bursts. I think the first thing I'd like to experiment with is providing some back pressure on the read.

josephpeacock · 2020-01-18T01:29:04Z

Your math looks right to me, but I agree with your confusion as to why we're dropping metrics. Because both our observations and the query shows that it's happening.

I disagree with your point about burst-y traffic. We ran another test (running load with apache bench @ 180ish TPS), and saw the same results with consistent traffic.

----                -------------------- ------------------- ------------------ -------------
1579307470000000000
1579307480000000000 0.2                  0                   277.3              112234
1579307490000000000 0.3                  0                   411.7              101830
1579307500000000000 0.2                  0                   175.9              136706
1579307510000000000 0.2                  0                   276.7              157395
1579307520000000000 41.9                 0                   52788.2            73631
1579307530000000000 357.2                0.3                 446503.1           94802
1579307540000000000 320.2                1.1                 399794.3           48012
1579307550000000000 198.9                1.9                 248776.6           43436
1579307560000000000 232.5                1.3                 291484.7           52663
1579307570000000000 248                  3                   310733.5           62526
1579307580000000000 174.4                0.3                 218750.3           71756
1579307590000000000 264                  3.4                 330615.8           75852
1579307600000000000 260.2                1                   325872.1           132632
1579307610000000000 202.1                1                   252689.8           92005
1579307620000000000 241.3                2.2                 302389.7           50857
1579307630000000000 245                  4.2                 306000.9           96203
1579307640000000000 205.1                1.7                 256785.1           86947
1579307650000000000 264.5                4                   330994.5           36155
1579307660000000000 262.1                1.2                 327640.4           114054
1579307670000000000 153.9                4.1                 192734.1           37220
1579307680000000000 259.2                1.9                 325299.7           42462
1579307690000000000 252.1                2.9                 315188.4           41913
1579307700000000000 192.8                0.5                 241323.8           124480
1579307710000000000 248.4                2.1                 311364.3           101132
1579307720000000000 250.9                1.9                 313929.8           53378
1579307730000000000 186.7                1.5                 233654.6           79544
1579307740000000000 268.7                3.6                 336648.6           60632
1579307750000000000 245.4                0.9                 307703.2           180679
1579307760000000000 190.4                1.5                 238610.9           58723
1579307770000000000 247                  2.8                 308245.4           73668
1579307780000000000 243                  2.6                 304312.5           74775
1579307790000000000 191.1                1.1                 239114.4           100258
1579307800000000000 263.2                1.1                 329587.5           40992
1579307810000000000 264.6                2.7                 330493.2           212077
1579307820000000000 197.1                2.4                 246408.2           56969
1579307830000000000 240.9                0.8                 301886             79196
1579307840000000000 239.6                2.3                 300350.8           92575
1579307850000000000 173.3                1.9                 216993.8           147062
1579307860000000000 257.6                2.1                 321496.8           109190
1579307870000000000 235.2                1.3                 294824             139790
1579307880000000000 187.7                2.2                 234996.3           42655
1579307890000000000 264.2                1.2                 330399.5           33678
1579307900000000000 253.9                2.7                 317577.2           124537
1579307910000000000 159.1                1.1                 198823.7           46188
1579307920000000000 253.2                0.2                 317175.5           63011
1579307930000000000 255.5                0.8                 319715.1           39170
1579307940000000000 191.3                0.9                 239078.4           192458
1579307950000000000 213.9                3                   267850.9           47457
1579307960000000000 92.7                 0.9                 116460.7           128208
1579307970000000000 198.7                1.6                 249137.6           84819
1579307980000000000 281.4                2.8                 352809.7           46719
1579307990000000000 226.4                1.6                 283799.3           121936
1579308000000000000 141.3                2.3                 177054.7           120866
1579308010000000000 204.5                1.6                 256587.8           50455
1579308020000000000 260.4                1.5                 325478.7           60674
1579308030000000000 141.1                1.2                 177105.4           130709
1579308040000000000 271.7                2.9                 339287.6           59070
1579308050000000000 211.2                2.3                 263633.9           134682
1579308060000000000 199.1                0.6                 250201.1           46496
1579308070000000000 269.4                2.6                 336914.8           68329
1579308080000000000 244.1                1.8                 305653.4           66552
1579308090000000000 162.6                1.8                 204249             93797
1579308100000000000 250.1                2.4                 312645.7           114548
1579308110000000000 244.5                1.7                 306858.4           101974
1579308120000000000 181.7                1.4                 227177.1           44624
1579308130000000000 255.2                1.4                 319303.5           79392
1579308140000000000 231.9                2.1                 289538.3           188465
1579308150000000000 166                  1.1                 208075.8           49103
1579308160000000000 270.6                1.5                 338887             66669
1579308170000000000 241.4                2.6                 302566.3           78792
1579308180000000000 183.6                1.5                 229910             102848
1579308190000000000 240.4                1.7                 301518.4           45990
1579308200000000000 235.3                2.4                 294988.5           81388
1579308210000000000 178.9                3.1                 224148.7           82345
1579308220000000000 272.2                2.7                 340400.6           86166
1579308230000000000 207.7                2.2                 259646.7           143561
1579308240000000000 193.3                1.8                 241856.9           46000
1579308250000000000 250.2                1.2                 313072.5           47307
1579308260000000000 253.8                1.5                 317189.3           51793
1579308270000000000 189.4                0.8                 236856.5           56525
1579308280000000000 211.1                1.2                 264901.5           156737
1579308290000000000 231.1                2.6                 289564.7           128131
1579308300000000000 182.8                1.6                 228361.3           151418
1579308310000000000 249.4                3.8                 312582.2           144441
1579308320000000000 98.8                 1.6                 124005.2           46119
1579308330000000000 140.6                4.1                 175940.1           86516
1579308340000000000 221.2                5.1                 276876.4           46480
1579308350000000000 283.7                0.4                 353892             68896
1579308360000000000 191.5                1.8                 239580.7           94222
1579308370000000000 164                  1.8                 204937.6           32434
1579308380000000000 185.3                3.1                 232587.9           58675
1579308390000000000 154                  2.6                 192211.4           128214
1579308400000000000 302.7                1.5                 378125.3           116925
1579308410000000000 250.9                1.8                 313631             96523
1579308420000000000 197.8                3.3                 247601.4           44264
1579308430000000000 240.3                0.9                 301568.4           133977
1579308440000000000 228.4                1.6                 285588.3           95142
1579308450000000000 211                  0.3                 264537.9           93770
1579308460000000000 275.6                3.7                 345102.8           90413
1579308470000000000 135                  2.9                 169034.1           95988
1579308480000000000 130.4                1.1                 162734.7           101313
1579308490000000000 206.8                4.5                 258331.2           63642
1579308500000000000 278.2                2.5                 348113.1           47506
1579308510000000000 191.5                1.9                 239776.9           150727
1579308520000000000 208.3                3.6                 260530.2           102215
1579308530000000000 167.7                3.6                 209232.7           110656
1579308540000000000 159.3                2.6                 199301.5           44451
1579308550000000000 254.6                1.4                 317895.6           53859
1579308560000000000 238.3                0.7                 298583             47708
1579308570000000000 148.7                1.1                 185388.8           45402
1579308580000000000 192.6                4                   241569.7           75696
1579308590000000000 184.7                5                   231461.8           49245
1579308600000000000 172.5                1.9                 215954.4           49212
1579308610000000000 307.9                3.7                 384211.7           79421
1579308620000000000 264.3                2.2                 331565.1           84770
1579308630000000000 191.3                2                   239067.4           46503
1579308640000000000 214.1                0.8                 267804.2           81859
1579308650000000000 232.3                3.3                 291142.8           82754
1579308660000000000 187.5                1.5                 235239.7           141750
1579308670000000000 221.1                1.3                 276698.2           39679
1579308680000000000 217.6                2.3                 272922.7           69782
1579308690000000000 50.9                 0.1                 63737.1            67401

As another experiment, I opened a PR for the threading changes (6922) and ran that version of telegraf, which did resolve our local dropping issues. (Or at least added >100TPS to what we could handle before dropping any messages). So I think that may be our ideal path forward, unless you disagree?

Thanks,
-Joe

danielnelson · 2020-01-22T01:31:06Z

I think we can add more working threads, though I don't want to expose an option for number of workers, I'll comment about that on the PR.

What setting for allowed_pending_messages are you currently using in production?

josephpeacock · 2020-01-22T01:45:07Z

Yeah, the PR is probably the best place for that conversation to take place.

We've tried values as high as 2,400,000 for allowed_pending_messages, without seeing a meaningful decrease in the packet drop rate. The higher values buy us more time until the first packet is dropped, and make Telegraf log packet drops less frequently. But the total throughput problem remains.

josephpeacock · 2020-03-04T00:34:50Z

#6922 Was merged.

danielnelson added area/statsd feature request Requests for new plugin and for new features to existing plugins performance problems with decreased performance or enhancements that improve performance labels Jan 17, 2020

josephpeacock closed this as completed Mar 4, 2020

danielnelson added this to the 1.14.0 milestone Mar 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StatsD Input: Add option for threaded parsing of metrics #6919

StatsD Input: Add option for threaded parsing of metrics #6919

josephpeacock commented Jan 17, 2020

danielnelson commented Jan 17, 2020 •

edited

Loading

josephpeacock commented Jan 17, 2020 •

edited

Loading

josephpeacock commented Jan 17, 2020 •

edited

Loading

danielnelson commented Jan 18, 2020

josephpeacock commented Jan 18, 2020

danielnelson commented Jan 22, 2020

josephpeacock commented Jan 22, 2020

josephpeacock commented Mar 4, 2020

StatsD Input: Add option for threaded parsing of metrics #6919

StatsD Input: Add option for threaded parsing of metrics #6919

Comments

josephpeacock commented Jan 17, 2020

Feature Request

Proposal:

Current behavior:

Desired behavior:

Use case:

danielnelson commented Jan 17, 2020 • edited Loading

josephpeacock commented Jan 17, 2020 • edited Loading

josephpeacock commented Jan 17, 2020 • edited Loading

danielnelson commented Jan 18, 2020

josephpeacock commented Jan 18, 2020

danielnelson commented Jan 22, 2020

josephpeacock commented Jan 22, 2020

josephpeacock commented Mar 4, 2020

danielnelson commented Jan 17, 2020 •

edited

Loading

josephpeacock commented Jan 17, 2020 •

edited

Loading

josephpeacock commented Jan 17, 2020 •

edited

Loading