PrometheusDuplicateTimestamps errors with log_to_metrics filter starting in fluent-bit 3.1.5 #9413

reneeckstein · 2024-09-23T17:29:15Z

Bug Report

Describe the bug
After upgrading from fluent-bit 3.1.4 to 3.1.5 all our k8s clusters start reporting PrometheusDuplicateTimestamps errors
Prometheus metric rate(prometheus_target_scrapes_sample_duplicate_timestamp_total}[5m]) > 0 is increasing.
Prometheus is logging a lot of warnings like this:

ts=2024-09-23T16:22:32.820Z caller=scrape.go:1754 level=warn component="scrape manager" scrape_pool=serviceMonitor/platform-logging/fluent-bit/1 target=http://10.67.3.197:2021/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=32
ts=2024-09-23T16:22:38.237Z caller=scrape.go:1754 level=warn component="scrape manager" scrape_pool=serviceMonitor/platform-logging/fluent-bit/1 target=http://10.67.4.81:2021/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=1876
ts=2024-09-23T16:22:39.697Z caller=scrape.go:1754 level=warn component="scrape manager" scrape_pool=serviceMonitor/platform-logging/fluent-bit/1 target=http://10.67.13.208:2021/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=4
ts=2024-09-23T16:22:41.643Z caller=scrape.go:1754 level=warn component="scrape manager" scrape_pool=serviceMonitor/platform-logging/fluent-bit/1 target=http://10.67.3.110:2021/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=7

To Reproduce

Steps to reproduce the problem:
- Deploy fluent-bit with the below tail input config as a daemonset into a k8s cluster using version 3.1.4 to see container logs and metrics to validate success.
- Update fluent-bit image to 3.1.5 (or newer, <= 3.1.8) and verify /metrics endpoint on port 2021

Expected behavior
No duplicate metrics on the additional endpoint /metrics for log_to_metrics feature usually on port 2021, no warnings in Prometheus logs, no PrometheusDuplicateTimestamps errors.

Screenshots

Your Environment

Version used: 3.1.5 (or higher) tested until 3.1.8, issue is still present.
Configuration: (Helm chart values)

serviceMonitor:
  enabled: true
  interval: 10s
  scrapeTimeout: 10s
  additionalEndpoints:
  - port: log-metrics
    path: /metrics
    interval: 10s
    scrapeTimeout: 10s

extraPorts:
  - port: 2021
    containerPort: 2021
    protocol: TCP
    name: log-metrics

config:
  service: |
    [SERVICE]
        Flush 1
        Daemon Off
        Log_Level info
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port {{ .Values.service.port }}

  inputs: |
    [INPUT]
        Name tail
        Tag kube.*
        Alias tail_container_logs
        Path /var/log/containers/*.log
        multiline.parser docker, cri
        DB /var/log/flb_kube.db
        DB.locking true
        Mem_Buf_Limit 32MB
        Skip_Long_Lines On

  filters: |
    [FILTER]
        Name kubernetes
        Alias kubernetes_all
        Match kube.*
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude On
        Annotations Off
        Buffer_Size 1MB
        Use_Kubelet true

    [FILTER]
        name               log_to_metrics
        match              kube.*
        tag                log_counter_metric
        metric_mode        counter
        metric_name        kubernetes_messages
        metric_description This metric counts Kubernetes messages
        kubernetes_mode    true

  outputs: |
    [OUTPUT]
        name               prometheus_exporter
        match              log_counter_metric
        host               0.0.0.0
        port               2021

Environment name and version (e.g. Kubernetes? What version?):
- EKS; Kubernetes 1.30
Server type and version:
Operating System and version:
- EKS on Bottlerocket OS 1.22.0 (aws-k8s-1.30) Kernel version 6.1.106 containerd://1.7.20+bottlerocket
Filters and plugins:
- kubernetes, log_to_metrics

Additional context
It is just very annoying when every k8s cluster with this common configuration reports PrometheusDuplicateTimestamps errors

The text was updated successfully, but these errors were encountered:

edsiper · 2024-09-26T20:47:11Z

@reneeckstein are you facing the same issue with v3.1.8 ? (we have some fixes in place for a similar problem)

reneeckstein · 2024-09-26T20:50:49Z

@edsiper Yes we are facing the same issue in fluent-bit v3.1.8. I'm looking forward for v3.1.9 I noticed to metrics-related commits on master branch.

AmiSMB · 2024-10-02T13:11:19Z

This is happening on 3.0.3 in our K8s cluster. We have tried rolling out each version up and including the latest 3.1.9 but are still getting the same issue. If we use the following script:-

at the metrics path then we see the following:-

`Handling connection for 2020

  2 fluentbit_output_dropped_records_total{name="out_fluentd"} 0
  2 fluentbit_output_errors_total{name="out_fluentd"} 0
  2 fluentbit_output_retries_failed_total{name="out_fluentd"} 0`

What is strange is the the fluendbit helm chart uses the path of /api/v2/metrics/prometheus for the service monitor which is even worse:-

`Handling connection for 2020

  2 fluentbit_output_chunk_available_capacity_percent{name="out_fluentd"}
  2 fluentbit_output_dropped_records_total{name="out_fluentd"}
  2 fluentbit_output_errors_total{name="out_fluentd"}
  2 fluentbit_output_proc_bytes_total{name="out_fluentd"}
  2 fluentbit_output_proc_records_total{name="out_fluentd"}
  2 fluentbit_output_retried_records_total{name="out_fluentd"}
  2 fluentbit_output_retries_failed_total{name="out_fluentd"}
  2 fluentbit_output_retries_total{name="out_fluentd"}
  2 fluentbit_output_upstream_busy_connections{name="out_fluentd"}
  2 fluentbit_output_upstream_total_connections{name="out_fluentd"}`

drbugfinder-work · 2024-10-02T14:51:01Z

@edsiper We are also seeing these kind of Prometheus errors.
I don't know yet where this is coming from.

Enabling the flush_interval (#9251) seems to mitigate this issue, but I'm still trying to find the root cause. Was there any kind of cmetrics change maybe...?

drbugfinder-work · 2024-10-14T12:18:00Z

@edsiper @cosmo0920
Update from my end:
I’ve analyzed the issue and discovered a few things. It appears to be a race condition in this section of the code

fluent-bit/src/flb_input_chunk.c

Line 1534 in 50ff1fa

ic = input_chunk_get(in, event_type, tag, tag_len, buf_size, &set_down);

Consider the following configuration:

[SERVICE]
    flush              1
    log_level          info
[INPUT]
    Name               dummy
    Dummy              {"message":"dummy", "kubernetes":{"namespace_name": "default", "docker_id": "abc123", "pod_name": "pod1", "container_name": "mycontainer", "pod_id": "def456", "labels":{"app": "app1"}}, "duration": 20, "color": "red", "shape": "circle"}
    Samples            1
    Tag                dummy.log
    Alias              dummy1
[INPUT]
    Name               dummy
    Dummy              {"message":"hello", "kubernetes":{"namespace_name": "default", "docker_id": "abc123", "pod_name": "pod1", "container_name": "mycontainer", "pod_id": "def456", "labels":{"app": "app1"}}, "duration": 60, "color": "blue", "shape": "square"}
    Samples            1
    Tag                dummy.log2
    Alias              dummy2
[FILTER]
    name               log_to_metrics
    match              dummy.log*
    tag                test_metric
    metric_mode        counter
    metric_namespace   my_log_metric
    metric_subsystem   my_counter
    # flush_interval_sec       1
    metric_name        count_all_dummy_messages
    kubernetes_mode    on
    metric_description Dummy message count
[OUTPUT]
    name               prometheus_exporter
    match              *
    host               0.0.0.0
    port               2021

With a build later than 3.1.5, this results in the following metrics output:

~/test# curl http://localhost:2021/metrics
# HELP my_log_metric_my_crazy_counter_count_all_dummy_messages This metric counts dummy messages
# TYPE my_log_metric_my_crazy_counter_count_all_dummy_messages counter
my_log_metric_my_crazy_counter_count_all_dummy_messages{namespace_name="default",pod_name="pod1",
container_name="mycontainer",docker_id="abc123",pod_id="def456"} 1
# HELP my_log_metric_my_crazy_counter_count_all_dummy_messages This metric counts dummy messages
# TYPE my_log_metric_my_crazy_counter_count_all_dummy_messages counter
my_log_metric_my_crazy_counter_count_all_dummy_messages{namespace_name="default",pod_name="pod1",
container_name="mycontainer",docker_id="abc123",pod_id="def456"} 2

As shown, the metric is duplicated, with an increased (correct) value.

I found that for the first two metric appends, two chunks are created because the first one is busy/locked in this section:

fluent-bit/src/flb_input_chunk.c

Lines 1127 to 1129 in 50ff1fa

    
           if (ic->busy == FLB_TRUE || cio_chunk_is_locked(ic->chunk)) { 
        
               ic = NULL; 
        
           }

I’m not sure why this happens.

If there’s more time between the metric appends, the issue does not occur. The same is true when the new interval option is enabled.

The doubled metrics seem to have started appearing with this commit: out_prometheus_exporter: Handle multiply concatenated metrics type of events

If I revert this commit, I encounter a different behavior:

With the configuration above, I now get:

~/test# curl http://localhost:2021/metrics
# HELP my_log_metric_my_crazy_counter_count_all_dummy_messages This metric counts dummy messages
# TYPE my_log_metric_my_crazy_counter_count_all_dummy_messages counter
my_log_metric_my_crazy_counter_count_all_dummy_messages{namespace_name="default",pod_name="pod1",
container_name="mycontainer",docker_id="abc123",pod_id="def456"} 1

This output is incorrect, as it seems the second metric append from dummy2 is somehow lost.

However, if I change Samples to 2 for dummy2 in the configuration, the output changes to:

~/test# curl http://localhost:2021/metrics
# HELP my_log_metric_my_crazy_counter_count_all_dummy_messages This metric counts dummy messages
# TYPE my_log_metric_my_crazy_counter_count_all_dummy_messages counter
my_log_metric_my_crazy_counter_count_all_dummy_messages{namespace_name="default",pod_name="pod1",
container_name="mycontainer",docker_id="abc123",pod_id="def456"} 1

# wait a second

~/test# curl http://localhost:2021/metrics
# HELP my_log_metric_my_crazy_counter_count_all_dummy_messages This metric counts dummy messages
# TYPE my_log_metric_my_crazy_counter_count_all_dummy_messages counter
my_log_metric_my_crazy_counter_count_all_dummy_messages{namespace_name="default",pod_name="pod1",
container_name="mycontainer",docker_id="abc123",pod_id="def456"} 3

In this case, the metric updates normally within the log_to_metrics filter, and because the chunks are now created and not busy (for some reason), everything works as expected. The first update however is "skipped" (the assumed values would be 2 and then 3, not 1 and 3).
I suspect that this behavior was always present but went unnoticed until now. Nonetheless, I believe this is a general metric append issue and not limited to the log_to_metrics filter specifically.

@edsiper, do you have any ideas how to proceed?

reneeckstein · 2024-11-14T15:50:55Z

@edsiper any update on this issue? we checked fluent-bit 3.1.10 but it still has the same issue

lmolas · 2024-11-15T13:46:23Z

I am also interested for some updates, facing the same issue.

jplitza · 2024-11-18T20:10:47Z

In my setup at least, I was able to fix the problem by updating to Fluent Bit 3.2 and setting Flush_Interval_Sec 10. Lower values might work too.

drbugfinder-work · 2024-11-19T07:50:24Z

@edsiper @cosmo0920 Update from my end: I’ve analyzed the issue and discovered a few things. It appears to be a race condition in this section of the code

fluent-bit/src/flb_input_chunk.c

Line 1534 in 50ff1fa

ic = input_chunk_get(in, event_type, tag, tag_len, buf_size, &set_down);

Consider the following configuration:
[SERVICE]
    flush              1
    log_level          info
[INPUT]
    Name               dummy
    Dummy              {"message":"dummy", "kubernetes":{"namespace_name": "default", "docker_id": "abc123", "pod_name": "pod1", "container_name": "mycontainer", "pod_id": "def456", "labels":{"app": "app1"}}, "duration": 20, "color": "red", "shape": "circle"}
    Samples            1
    Tag                dummy.log
    Alias              dummy1
[INPUT]
    Name               dummy
    Dummy              {"message":"hello", "kubernetes":{"namespace_name": "default", "docker_id": "abc123", "pod_name": "pod1", "container_name": "mycontainer", "pod_id": "def456", "labels":{"app": "app1"}}, "duration": 60, "color": "blue", "shape": "square"}
    Samples            1
    Tag                dummy.log2
    Alias              dummy2
[FILTER]
    name               log_to_metrics
    match              dummy.log*
    tag                test_metric
    metric_mode        counter
    metric_namespace   my_log_metric
    metric_subsystem   my_counter
    # flush_interval_sec       1
    metric_name        count_all_dummy_messages
    kubernetes_mode    on
    metric_description Dummy message count
[OUTPUT]
    name               prometheus_exporter
    match              *
    host               0.0.0.0
    port               2021
With a build later than 3.1.5, this results in the following metrics output:
~/test# curl http://localhost:2021/metrics
# HELP my_log_metric_my_crazy_counter_count_all_dummy_messages This metric counts dummy messages
# TYPE my_log_metric_my_crazy_counter_count_all_dummy_messages counter
my_log_metric_my_crazy_counter_count_all_dummy_messages{namespace_name="default",pod_name="pod1",
container_name="mycontainer",docker_id="abc123",pod_id="def456"} 1
# HELP my_log_metric_my_crazy_counter_count_all_dummy_messages This metric counts dummy messages
# TYPE my_log_metric_my_crazy_counter_count_all_dummy_messages counter
my_log_metric_my_crazy_counter_count_all_dummy_messages{namespace_name="default",pod_name="pod1",
container_name="mycontainer",docker_id="abc123",pod_id="def456"} 2
As shown, the metric is duplicated, with an increased (correct) value.

I found that for the first two metric appends, two chunks are created because the first one is busy/locked in this section:

fluent-bit/src/flb_input_chunk.c

Lines 1127 to 1129 in 50ff1fa

if (ic->busy == FLB_TRUE || cio_chunk_is_locked(ic->chunk)) {

ic = NULL;

}

I’m not sure why this happens.
If there’s more time between the metric appends, the issue does not occur. The same is true when the new interval option is enabled.

The doubled metrics seem to have started appearing with this commit: out_prometheus_exporter: Handle multiply concatenated metrics type of events If I revert this commit, I encounter a different behavior:

With the configuration above, I now get:
~/test# curl http://localhost:2021/metrics
# HELP my_log_metric_my_crazy_counter_count_all_dummy_messages This metric counts dummy messages
# TYPE my_log_metric_my_crazy_counter_count_all_dummy_messages counter
my_log_metric_my_crazy_counter_count_all_dummy_messages{namespace_name="default",pod_name="pod1",
container_name="mycontainer",docker_id="abc123",pod_id="def456"} 1
This output is incorrect, as it seems the second metric append from dummy2 is somehow lost.

However, if I change Samples to 2 for dummy2 in the configuration, the output changes to:
~/test# curl http://localhost:2021/metrics
# HELP my_log_metric_my_crazy_counter_count_all_dummy_messages This metric counts dummy messages
# TYPE my_log_metric_my_crazy_counter_count_all_dummy_messages counter
my_log_metric_my_crazy_counter_count_all_dummy_messages{namespace_name="default",pod_name="pod1",
container_name="mycontainer",docker_id="abc123",pod_id="def456"} 1

# wait a second

~/test# curl http://localhost:2021/metrics
# HELP my_log_metric_my_crazy_counter_count_all_dummy_messages This metric counts dummy messages
# TYPE my_log_metric_my_crazy_counter_count_all_dummy_messages counter
my_log_metric_my_crazy_counter_count_all_dummy_messages{namespace_name="default",pod_name="pod1",
container_name="mycontainer",docker_id="abc123",pod_id="def456"} 3
In this case, the metric updates normally within the log_to_metrics filter, and because the chunks are now created and not busy (for some reason), everything works as expected. The first update however is "skipped" (the assumed values would be 2 and then 3, not 1 and 3). I suspect that this behavior was always present but went unnoticed until now. Nonetheless, I believe this is a general metric append issue and not limited to the log_to_metrics filter specifically.

@edsiper, do you have any ideas how to proceed?

Ping @edsiper @cosmo0920 Can you please look at this. To me this looks like a general issue, and it currently only occurs at the log_to_metrics plugin, because of the potentially high rate of metric updates. From my understanding this can happen to every other metric plugin, as well, as long as the update rate is high enough.
From the plugin side there's not much we/I can do - as I proposed and @jplitza verified (thanks!), the new flush interval mitigates the issue, because the update rates are much lower then.

reneeckstein added the status: waiting-for-triage label Sep 23, 2024

edsiper removed the status: waiting-for-triage label Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PrometheusDuplicateTimestamps errors with log_to_metrics filter starting in fluent-bit 3.1.5 #9413

PrometheusDuplicateTimestamps errors with log_to_metrics filter starting in fluent-bit 3.1.5 #9413

reneeckstein commented Sep 23, 2024

edsiper commented Sep 26, 2024

reneeckstein commented Sep 26, 2024

AmiSMB commented Oct 2, 2024 •

edited

Loading

drbugfinder-work commented Oct 2, 2024

drbugfinder-work commented Oct 14, 2024 •

edited

Loading

reneeckstein commented Nov 14, 2024

lmolas commented Nov 15, 2024

jplitza commented Nov 18, 2024

drbugfinder-work commented Nov 19, 2024

PrometheusDuplicateTimestamps errors with log_to_metrics filter starting in fluent-bit 3.1.5 #9413

PrometheusDuplicateTimestamps errors with log_to_metrics filter starting in fluent-bit 3.1.5 #9413

Comments

reneeckstein commented Sep 23, 2024

Bug Report

edsiper commented Sep 26, 2024

reneeckstein commented Sep 26, 2024

AmiSMB commented Oct 2, 2024 • edited Loading

drbugfinder-work commented Oct 2, 2024

drbugfinder-work commented Oct 14, 2024 • edited Loading

reneeckstein commented Nov 14, 2024

lmolas commented Nov 15, 2024

jplitza commented Nov 18, 2024

drbugfinder-work commented Nov 19, 2024

AmiSMB commented Oct 2, 2024 •

edited

Loading

drbugfinder-work commented Oct 14, 2024 •

edited

Loading