-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StatsD Input: Add option for threaded parsing of metrics #6919
Comments
I'm assuming you are listening on UDP? If so, would you be able to run a development build (#6921) with some extra internal metrics? Let me know you platform and I can link to the CI build. To enable the internal metrics you will need to add this plugin: [[inputs.internal]]
collect_memstats = false After restarting, I'd like to look at these two queries:
|
We can't run a development build in our customer-facing environments, but we've been able to reproduce our issues locally. Do you have a Mac executable built? Looking at your PR, I only see Linux/Windows commented. Thanks! **Edit: Found it. Will respond back with what we find from your queries. |
Alright, here's what we've got. With the disclaimer that this was run locally, not taken from a deployed environment. However, the behavior we've reproduced is essentially identical (just with smaller values) to what we've seen in production. First query:
The second query returned zero results. Telegraf debug log:
telegraf.conf:
|
It looks like my code handled The second query would also require the [[inputs.net]]
interfaces = ["eth0"]
ignore_protocol_stats = false Cherry picking a few items for reference:
The plugin should be able to parse 1s / parse_time_ns worth of packets per second, for these two lines:
The packets received is much lower here, though it is hard to say if this holds with your production systems, so I expect having multiple parse goroutines will probably help a small amount but not really solve the issue, since the problem isn't throughput its dealing with traffic bursts. I think the first thing I'd like to experiment with is providing some back pressure on the read. |
Your math looks right to me, but I agree with your confusion as to why we're dropping metrics. Because both our observations and the query shows that it's happening. I disagree with your point about burst-y traffic. We ran another test (running load with apache bench @ 180ish TPS), and saw the same results with consistent traffic.
As another experiment, I opened a PR for the threading changes (6922) and ran that version of telegraf, which did resolve our local dropping issues. (Or at least added >100TPS to what we could handle before dropping any messages). So I think that may be our ideal path forward, unless you disagree? Thanks, |
I think we can add more working threads, though I don't want to expose an option for number of workers, I'll comment about that on the PR. What setting for |
Yeah, the PR is probably the best place for that conversation to take place. We've tried values as high as 2,400,000 for |
#6922 Was merged. |
Feature Request
Opening a feature request kicks off a discussion.
Proposal:
Add a configuration options (possibly max_parser_threads) to the StatsD Input plugin config to allow user configuration of the number of goroutines to spin up for parsing inputs off of the input channel to increase throughput.
Current behavior:
As per my reading, every incoming messages adds to the
s.in
channel (which is of size allowed_pending_messages), and only the parser() function is pulling messages off of the channel.The parser is run using a goroutine here:
telegraf/plugins/inputs/statsd/statsd.go
Line 395 in d7b3f1f
Desired behavior:
Spin up n goroutines to do the listening instead of 1. But still default to the current behavior.
Use case:
My team is seeing consistent message dropping from our telegraph instances using the statsD input plugin. We're seeing the log message
[inputs.statsd] Error: statsd message queue full. We have dropped <num> messages so far. You may want to increase allowed_pending_messages in the config
Our issue, I believe, is due to the fact that the parser isn't able to keep up with incoming message volume. We'd like to improve its throughput.
If people have other suggestions for configuration tweaks we can make to improve throughput, we'd also love to hear those!
The text was updated successfully, but these errors were encountered: