Vector seems to be hanging and not flushing its internal buffer #13718

Sh4d1 · 2022-07-26T13:45:45Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello!

So I'm running Vector in the follwing setup:

I have 2 VMs behind a keepalived IP. So only one node receive traffic (normally, see below).

These 2 nodes are running Vector with the following configuration (click)

[api]
enabled = true
address = "127.0.0.1:8686"

# HTTP server to receive logs from agents
[sources.vector_in]
type = "vector"
address = "0.0.0.0:8700"
version = "2"

# Parse logs receive from all the agents
[transforms.vector_parser]
inputs = [ "vector_in", "journald_parser" ]
type   = "remap"
source = ''' # some json parsing
'''

# Read from journald
[sources.journald_in]
type = "journald"
since_now = true
[transforms.journald_filter]
type = "filter"
inputs = [ "journald_in" ]
condition = 'true && some comparaisons)'

# Parse journald logs
[transforms.journald_parser]
inputs = [ "journald_filter" ]
type   = "remap"
source = '''
.log.ident = del(.SYSLOG_IDENTIFIER)
.log.pid = to_string(del(._PID)) ?? ""
.log.pri = to_syslog_level(to_int(del(.PRIORITY)) ?? 0) ?? ""
.log.@timestamp = del(.timestamp)
.log.message = del(.message)
.log.host = del(.host)

. = .log
'''

[transforms.haproxy_filter]
type = "filter"
inputs = [ "vector_parser" ]
condition = ".ident == \"haproxy\""
[sinks.s3_out]
type = "aws_s3"
inputs = [ "haproxy_filter" ]
[...]
batch.max_bytes = 50_000_000
encoding.codec = "ndjson"
[sinks.s3_out.proxy]
enabled = true
[...]
[sinks.http_out]
type = "http"
inputs = [ "haproxy_filter" ]
[...]
compression = "none"
encoding.codec = "json"

[sinks.elasticsearch_out]
type = "elasticsearch"
inputs = [ "vector_parser" ]
[...]
suppress_type_name = true
mode = "bulk"

# Scrap internal metrics
[sources.metrics_in]
type = "internal_metrics"
scrape_interval_secs = 20
# Expose prometheus metrics
[sinks.metrics_out]
type = "prometheus"
inputs = [ "metrics_in" ]
[...]
default_namespace = "service"

Then there is the whole fleet of servers, also running Vector and sending the logs to the 2 Vector behind keepalived.

Here is the relevant and common part (click)

# Send all logs to vector server
[sinks.vector_out]
type = "vector"
inputs = [ "journald_parser", "syslog_parser"]
address = "VIP:8700"
version = "2"

The setup was running with Vector 0.21.1 for like a month and a half, and I had a pretty flat RAM usage at 1.3/4G for between 10k and 100k logs/s received by one of the vector VMs.

A few days ago, in one of these setups, one receiving vector started to get oom killed in a loop. (8G max). I tried to restart Vector, change the VIP to the other node, upgrade to 0.23.0, but still had the same issue.

Obvisouly it looked dependant of the incoming logs load. Since I was on weekend, I didn't really have time to dig more. Fast forward until today, where the pattern happened again. I had upgraded to 12G of RAM, but still. I also added a healthcheck to keepalived with a curl on the health api with a 2 (and now 10) seconds timeout.

When the RAM usage is big, it ususally takes more than 10s to answer the healthcheck, leading in a VIP switch. This currently happens every 2 to 10 minutes, leading to possibly some ARP cache issue and thus distributing the load between the 2 VMs.

The only thing I so (see the video below), was that the RAM usage is related to the difference between the Events In and Events Out of the vector_in source. I'm not sure how to tackle the issue to get more information though. Also, (seen on the video), looks like sometimes Vector is not doing anything (i.e no events)

Is it normal behaviour? (I'm going to try to share the load with a real loadbalancer, and most likey add more nodes)
Thanks and happy to help and debug!

simplescreenrecorder-2022-07-26_15.04.32.mp4

Configuration

No response

Version

0.23.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

Sh4d1 added the type: bug A code related bug. label Jul 26, 2022

Sh4d1 mentioned this issue Jul 27, 2022

Exit non-zero when Vector fails to gracefully shutdown #13731

Closed

smitthakkar96 mentioned this issue Feb 23, 2023

Internal Metrics Hangs Forever #16561

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector seems to be hanging and not flushing its internal buffer #13718

Vector seems to be hanging and not flushing its internal buffer #13718

Sh4d1 commented Jul 26, 2022

Vector seems to be hanging and not flushing its internal buffer #13718

Vector seems to be hanging and not flushing its internal buffer #13718

Comments

Sh4d1 commented Jul 26, 2022

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References