Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluent-bit crash when reading a log content containers charactor 'NUL' #9771

Open
FeiYing9 opened this issue Dec 27, 2024 · 1 comment
Open

Comments

@FeiYing9
Copy link

FeiYing9 commented Dec 27, 2024

Bug Report

Describe the bug

FluntBit was running as a pod in the k8s, I use it to collect logs from some log files.
Sometimes it crashed, after some debug I found maybe it due to the special charactor in the log content.

To Reproduce
the log generator script is normal like this:

#!/bin/bash

show_help() {
    echo "Usage: $0 [NUM_LINES]"
    echo "If NUM_LINES is not provided, it will default to print 100 lines of logs."
    echo "NUM_LINES should be a positive integer representing the number of log lines to print."
    exit 1
}

if [ $# -gt 1 ]; then
    show_help
fi

if [ -z "$1" ]; then
    LOG_LINES=100
else
    if [[ "$1" =~ ^[0-9]+$ ]]; then
        LOG_LINES=$1
    else
        echo "Error: The parameter should be a positive integer."
        show_help
    fi
fi

NODE_IP=$(hostname | awk '{print $1}')
POD_NAME=${POD:-"unknown-pod"}


extra_info="This is extra info for log line. Some more details here heiheihheiheihheiheiheiheiheiheiheiheiheiheiheiheiheiheiheiheihei."

for ((i = 0; i < LOG_LINES; i++)); do
    TIMESTAMP=$(date +"%Y-%m-%d")
    echo "$TIMESTAMP | Node IP: $NODE_IP | Count: $i | $extra_info"
    sleep 2
done

most of the logs is normal.

before fluent-bit crashed, the last log shows there are lots of 'NUL' where should be normal string charactor ('This is extra info for log line. Some more d').

the original logs file is big i cannot attach it here.

Screenshots
image

And i can not find any '0000' when i vim the log file by setting :%!xxd

image

Your Environment

  • Version used: 3.1.5
  • Configuration:
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterFluentBitConfig
metadata:
  name: fluent-bit-slurm-config
  namespace: fluent
  labels:
    name: slurm-log
    component: studio
spec:
  service:
    #logLevel: trace         # off, error, warn, info, debug, and trace
    parsersFile: parsers.conf
    httpServer: true
    flushSeconds: 1         # Interval to flush output
    storage:
      backlogMemLimit: 5M  
      path: /fluent-bit/tail/slurm.storage.backlog
      sync: normal      
      maxChunksUp: 128    
  inputSelector:
    matchLabels:
      name: slurm-log
      component: studio
  parserSelector:
    matchLabels:
      name: slurm-log
      component: studio
  filterSelector:
    matchLabels:
      name: slurm-log
      component: studio
  outputSelector:
    matchLabels:
      name: slurm-log
     component: studio
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: FluentBit
metadata:
  name: fluent-slurm
  namespace: fluent
  labels:
    name: slurm-log
    component: studio
spec:
  image: repo-addr/fluent/fluent-operator/fluent-bit:3.1.5
  ports: []
  resources:
    requests:
      cpu: 500m
      memory: 250Mi
    limits:
      cpu: "2"
      memory: 4Gi
  fluentBitConfigName: fluent-bit-slurm-config
  nodeSelector:           
    kubernetes.io/hostname: 10.31.19.40              # run in one node
  tolerations:
    - operator: Exists
  volumesMounts:
    - mountPath: /slurm
      name: logs
    - mountPath: /etc/localtime
      name: localtime
  positionDB:
    hostPath:
      path: /path/fluent-bit/     # tail插件db记录保存目录,根据实际情况修改
  volumes:
    - hostPath:
        path: /path/logs/hpc # 日志主机路径,根据实际情况修改
      name: logs
    - hostPath:
        path: /etc/localtime
      name: localtime
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes
  • Server type and version:
# kubectl  version
Client Version: v1.28.4-r0-28.0.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.6-r0-28.0.39.7
  • Operating System and version: Huawei Cloud EulerOS 2.0 (x86_64)
  • Filters and plugins:
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterInput
metadata:
  name: slurm
  namespace: fluent
  labels:
    name: slurm-log
    component: studio
spec:
  logLevel: debug  # off;error;warning;info;debug;trace
  tail:
    tag: "slurm.*"
    path: "/slurm/hpc-job-*/hpc-job-*/slurm-*"
    pathKey: "logPath"
    key: message
    bufferChunkSize: 32k
    bufferMaxSize: 128k
    multiline: false
    refreshIntervalSeconds: 1   # refreshing the list of watched files in seconds
    skipLongLines: true
    db: /fluent-bit/tail/slurm.db
    dbSync: Normal
    storageType: filesystem
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: slurm-fluent-bit-lua
  labels:
    name: slurm-log
    component: studio
  namespace: fluent
data:
  slurm-log.lua: |
    function extract_fields(tag, timestamp, record)
      local jobId, pod = string.match(tag, "slurm%.slurm%.([^%.]+)%.([^%.]+)%.*")
      if jobId and pod then
        record["job-id"] = jobId
        record["pod"] = pod
        return 2, timestamp, record
      end
      return -1, timestamp, record
    end
---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterFilter
metadata:
  name: slurm
  namespace: fluent
  labels:
    name: slurm-log
    component: studio
spec:
  matchRegex: "slurm.*"
  filters:
    - lua:
        script:
          key: slurm-log.lua
          name: slurm-fluent-bit-lua
        call: extract_fields
        timeAsTable: true
---

Additional context

@FeiYing9
Copy link
Author

I delete it from db, remove the storage backlog, and rename the logfile make it mismatch, it's going ok after restart fluent-bit.

Then i rename this logfile back, everything goes ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant