Socket listener only processes first 1 or 2 batches of metrics with errors #12176

7Hazard · 2022-11-04T11:49:19Z

Relevant telegraf.conf

[agent]
    debug = true
    logfile = ""
    interval = "10s"
    round_interval = true
    # metric_batch_size = 1000
    metric_buffer_limit = 50000
    # collection_jitter = "0s"
    # flush_interval = "10s"
    # flush_jitter = "0s"
    # precision = ""
    # hostname = ""
    # omit_hostname = false

[[outputs.influxdb_v2]]
    urls = ["http://localhost:8086"]
    token = "$TELEGRAF_INFLUX_TOKEN"
    organization = "devops"
    bucket = "devops"

[[outputs.file]]
    data_format = "influx"
    files = ["stdout"]

[[inputs.socket_listener]]
    service_address = "udp://:25826"

    ## Data format to consume.
    ## Each data format has its own unique set of configuration options, read
    ## more about them here:
    ##   https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
    data_format = "collectd"

    ## Authentication file for cryptographic security levels
    #   collectd_auth_file = "/etc/collectd/auth_file"
    ## One of none (default), sign, or encrypt
    #   collectd_security_level = "encrypt"
    collectd_security_level = "none"
    ## Path of to TypesDB specifications
    collectd_typesdb = ["/usr/share/collectd/types.db"]

    ## Multi-value plugins can be handled two ways.
    ## "split" will parse and store the multi-value plugin data into separate measurements
    ## "join" will parse and store the multi-value plugin as a single multi-value measurement.
    ## "split" is the default behavior for backward compatibility with previous versions of influxdb.
    collectd_parse_multivalue = "split"

Logs from Telegraf

https://gist.github.com/7Hazard/c3e7b49b8d2981cb99a6b8a0cc9e3238

System info

Docker Image - telegraf:1.24-alpine

Docker

FROM telegraf:1.24-alpine

COPY telegraf.conf /etc/telegraf/telegraf.conf
COPY collectd-types.db /usr/share/collectd/types.db

Steps to reproduce

These collectd metrics are forwarded from GitHub Enterprise to Telegraf.

Expected behavior

That metrics are continually sent to InfluxDB, not just during the first batches of Telegraf's program life.

Actual behavior

The first batches of metrics that are received, are processed with some errors and sent to InfluxDB successfully. But later on, it does no longer send any metrics to InfluxDB.

Additional info

Might be related to #5858

powersj · 2022-11-04T12:50:25Z

Hi,

That metrics are continually sent to InfluxDB, not just during the first batches of Telegraf's program life.

Based on the timestamps, you have provided 2 3 seconds worth of logs. Looking at the "Wrote batch of" messages," about 9000 metrics were written in 9 separate batches.

While there do appear to be a number of debug messages from the serializer that it was unable to serialize a field, those fields are skipped and processing on other fields does continue. You mentioned #5858, however, PR #5943 implemented the debug messages and ensured processing of other fields would continue.

Which metrics are no longer showing up? For how long did you let this run for? What is sending the data to the socket listener? How often is it sending new metrics or generating new metrics?

7Hazard · 2022-11-05T13:11:53Z

There is metrics being continuously sent from collectd in intervals of 10. By the end of the initial burst of metrics processed, the buffer stays empty of metrics, but there is certainly metrics being sent.
For more metrics to be processed and sent to InfluxDB by Telegraf, I have to restart it.
I mean, if I really wanted to do some hacky solution, I could restart Telegraf every 10 seconds instead to have it pick up and forward the metrics to InfluxDB.

7Hazard · 2022-11-07T08:36:11Z

Here are some more logs, where at the beginning of the program, the InfluxDB was turned off then later turned on after a couple of minutes: https://gist.github.com/7Hazard/1ef922b590592b4029e59255e287088b
Appears like the same behavior. The incoming buffer stays empty after the first burst.
It's unclear to me which metrics are failing to be serialized.

powersj · 2022-11-07T14:47:46Z

the InfluxDB was turned off

This should have no affect on how telegraf processes metrics from the socket listener. All this would do is increase the # of metrics in the buffer and produce some error messages about the output not being available. You can see this in messages like this:

2022-11-07T08:19:18Z D! [outputs.influxdb_v2] Buffer fullness: 5988 / 50000 metrics
2022-11-07T08:19:18Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)

It's unclear to me which metrics are failing to be serialized.

It says it in the logs, the statsd_value metric has a "NaN" as as part of a value field.

2022-11-07T08:19:14Z D! [outputs.file] Could not serialize metric: "statsd_value,host=github-test.********.se,type=latency,type_instance=github/unicorn/browser/requests_per_second-upper": no serializable fields
2022-11-07T08:19:14Z D! [serializers.influx] could not serialize field "value": is NaN; discarding field

I asked which metrics are no longer showing up and it still is not clear to me what you think is missing. It would be ideal if you could provide a way to reproduce what you think is the issue.

7Hazard · 2022-11-09T09:57:51Z

This appears to have been some networking issue in the Kubernetes cluster I was working in, where the health check for the UDP port was trying TCP instead, which prevented packets from being sent. I figured this out after I deduced that it wasn't an issue from Telegrafs side.
Pardon for the issue!

powersj · 2022-11-09T14:21:46Z

Thanks for following up!

7Hazard added the bug unexpected problem or unintended behavior label Nov 4, 2022

7Hazard changed the title ~~Socket listener only processes metrics with some errors in the beginning of the programs life~~ Socket listener only processes first 1 or 2 batches of metrics with errors Nov 4, 2022

powersj added the waiting for response waiting for response from contributor label Nov 4, 2022

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Nov 5, 2022

powersj added the waiting for response waiting for response from contributor label Nov 7, 2022

7Hazard closed this as completed Nov 9, 2022

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Socket listener only processes first 1 or 2 batches of metrics with errors #12176

Socket listener only processes first 1 or 2 batches of metrics with errors #12176

7Hazard commented Nov 4, 2022 •

edited

Loading

powersj commented Nov 4, 2022 •

edited

Loading

7Hazard commented Nov 5, 2022 •

edited

Loading

7Hazard commented Nov 7, 2022 •

edited

Loading

powersj commented Nov 7, 2022

7Hazard commented Nov 9, 2022

powersj commented Nov 9, 2022

Socket listener only processes first 1 or 2 batches of metrics with errors #12176

Socket listener only processes first 1 or 2 batches of metrics with errors #12176

Comments

7Hazard commented Nov 4, 2022 • edited Loading

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

powersj commented Nov 4, 2022 • edited Loading

7Hazard commented Nov 5, 2022 • edited Loading

7Hazard commented Nov 7, 2022 • edited Loading

powersj commented Nov 7, 2022

7Hazard commented Nov 9, 2022

powersj commented Nov 9, 2022

7Hazard commented Nov 4, 2022 •

edited

Loading

powersj commented Nov 4, 2022 •

edited

Loading

7Hazard commented Nov 5, 2022 •

edited

Loading

7Hazard commented Nov 7, 2022 •

edited

Loading