Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket listener only processes first 1 or 2 batches of metrics with errors #12176

Closed
7Hazard opened this issue Nov 4, 2022 · 6 comments
Closed
Labels
bug unexpected problem or unintended behavior

Comments

@7Hazard
Copy link

7Hazard commented Nov 4, 2022

Relevant telegraf.conf

[agent]
    debug = true
    logfile = ""
    interval = "10s"
    round_interval = true
    # metric_batch_size = 1000
    metric_buffer_limit = 50000
    # collection_jitter = "0s"
    # flush_interval = "10s"
    # flush_jitter = "0s"
    # precision = ""
    # hostname = ""
    # omit_hostname = false

[[outputs.influxdb_v2]]
    urls = ["http://localhost:8086"]
    token = "$TELEGRAF_INFLUX_TOKEN"
    organization = "devops"
    bucket = "devops"

[[outputs.file]]
    data_format = "influx"
    files = ["stdout"]

[[inputs.socket_listener]]
    service_address = "udp://:25826"

    ## Data format to consume.
    ## Each data format has its own unique set of configuration options, read
    ## more about them here:
    ##   https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
    data_format = "collectd"

    ## Authentication file for cryptographic security levels
    #   collectd_auth_file = "/etc/collectd/auth_file"
    ## One of none (default), sign, or encrypt
    #   collectd_security_level = "encrypt"
    collectd_security_level = "none"
    ## Path of to TypesDB specifications
    collectd_typesdb = ["/usr/share/collectd/types.db"]

    ## Multi-value plugins can be handled two ways.
    ## "split" will parse and store the multi-value plugin data into separate measurements
    ## "join" will parse and store the multi-value plugin as a single multi-value measurement.
    ## "split" is the default behavior for backward compatibility with previous versions of influxdb.
    collectd_parse_multivalue = "split"

Logs from Telegraf

https://gist.github.com/7Hazard/c3e7b49b8d2981cb99a6b8a0cc9e3238

System info

Docker Image - telegraf:1.24-alpine

Docker

FROM telegraf:1.24-alpine

COPY telegraf.conf /etc/telegraf/telegraf.conf
COPY collectd-types.db /usr/share/collectd/types.db

Steps to reproduce

These collectd metrics are forwarded from GitHub Enterprise to Telegraf.

Expected behavior

That metrics are continually sent to InfluxDB, not just during the first batches of Telegraf's program life.

Actual behavior

The first batches of metrics that are received, are processed with some errors and sent to InfluxDB successfully. But later on, it does no longer send any metrics to InfluxDB.

Additional info

Might be related to #5858

@7Hazard 7Hazard added the bug unexpected problem or unintended behavior label Nov 4, 2022
@7Hazard 7Hazard changed the title Socket listener only processes metrics with some errors in the beginning of the programs life Socket listener only processes first 1 or 2 batches of metrics with errors Nov 4, 2022
@powersj
Copy link
Contributor

powersj commented Nov 4, 2022

Hi,

That metrics are continually sent to InfluxDB, not just during the first batches of Telegraf's program life.

Based on the timestamps, you have provided 2 3 seconds worth of logs. Looking at the "Wrote batch of" messages," about 9000 metrics were written in 9 separate batches.

While there do appear to be a number of debug messages from the serializer that it was unable to serialize a field, those fields are skipped and processing on other fields does continue. You mentioned #5858, however, PR #5943 implemented the debug messages and ensured processing of other fields would continue.

Which metrics are no longer showing up? For how long did you let this run for? What is sending the data to the socket listener? How often is it sending new metrics or generating new metrics?

@powersj powersj added the waiting for response waiting for response from contributor label Nov 4, 2022
@7Hazard
Copy link
Author

7Hazard commented Nov 5, 2022

There is metrics being continuously sent from collectd in intervals of 10. By the end of the initial burst of metrics processed, the buffer stays empty of metrics, but there is certainly metrics being sent.
For more metrics to be processed and sent to InfluxDB by Telegraf, I have to restart it.
I mean, if I really wanted to do some hacky solution, I could restart Telegraf every 10 seconds instead to have it pick up and forward the metrics to InfluxDB.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Nov 5, 2022
@7Hazard
Copy link
Author

7Hazard commented Nov 7, 2022

Here are some more logs, where at the beginning of the program, the InfluxDB was turned off then later turned on after a couple of minutes: https://gist.github.com/7Hazard/1ef922b590592b4029e59255e287088b
Appears like the same behavior. The incoming buffer stays empty after the first burst.
It's unclear to me which metrics are failing to be serialized.

@powersj
Copy link
Contributor

powersj commented Nov 7, 2022

the InfluxDB was turned off

This should have no affect on how telegraf processes metrics from the socket listener. All this would do is increase the # of metrics in the buffer and produce some error messages about the output not being available. You can see this in messages like this:

2022-11-07T08:19:18Z D! [outputs.influxdb_v2] Buffer fullness: 5988 / 50000 metrics
2022-11-07T08:19:18Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)

It's unclear to me which metrics are failing to be serialized.

It says it in the logs, the statsd_value metric has a "NaN" as as part of a value field.

2022-11-07T08:19:14Z D! [outputs.file] Could not serialize metric: "statsd_value,host=github-test.********.se,type=latency,type_instance=github/unicorn/browser/requests_per_second-upper": no serializable fields
2022-11-07T08:19:14Z D! [serializers.influx] could not serialize field "value": is NaN; discarding field

I asked which metrics are no longer showing up and it still is not clear to me what you think is missing. It would be ideal if you could provide a way to reproduce what you think is the issue.

@powersj powersj added the waiting for response waiting for response from contributor label Nov 7, 2022
@7Hazard
Copy link
Author

7Hazard commented Nov 9, 2022

This appears to have been some networking issue in the Kubernetes cluster I was working in, where the health check for the UDP port was trying TCP instead, which prevented packets from being sent. I figured this out after I deduced that it wasn't an issue from Telegrafs side.
Pardon for the issue!

@7Hazard 7Hazard closed this as completed Nov 9, 2022
@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Nov 9, 2022
@powersj
Copy link
Contributor

powersj commented Nov 9, 2022

Thanks for following up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants