Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector seems to be hanging and not flushing its internal buffer #13718

Open
Sh4d1 opened this issue Jul 26, 2022 · 0 comments
Open

Vector seems to be hanging and not flushing its internal buffer #13718

Sh4d1 opened this issue Jul 26, 2022 · 0 comments
Labels
type: bug A code related bug.

Comments

@Sh4d1
Copy link
Contributor

Sh4d1 commented Jul 26, 2022

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello!

So I'm running Vector in the follwing setup:

I have 2 VMs behind a keepalived IP. So only one node receive traffic (normally, see below).

These 2 nodes are running Vector with the following configuration (click)
[api]
enabled = true
address = "127.0.0.1:8686"

# HTTP server to receive logs from agents
[sources.vector_in]
type = "vector"
address = "0.0.0.0:8700"
version = "2"

# Parse logs receive from all the agents
[transforms.vector_parser]
inputs = [ "vector_in", "journald_parser" ]
type   = "remap"
source = ''' # some json parsing
'''

# Read from journald
[sources.journald_in]
type = "journald"
since_now = true
[transforms.journald_filter]
type = "filter"
inputs = [ "journald_in" ]
condition = 'true && some comparaisons)'

# Parse journald logs
[transforms.journald_parser]
inputs = [ "journald_filter" ]
type   = "remap"
source = '''
.log.ident = del(.SYSLOG_IDENTIFIER)
.log.pid = to_string(del(._PID)) ?? ""
.log.pri = to_syslog_level(to_int(del(.PRIORITY)) ?? 0) ?? ""
.log.@timestamp = del(.timestamp)
.log.message = del(.message)
.log.host = del(.host)

. = .log
'''

[transforms.haproxy_filter]
type = "filter"
inputs = [ "vector_parser" ]
condition = ".ident == \"haproxy\""
[sinks.s3_out]
type = "aws_s3"
inputs = [ "haproxy_filter" ]
[...]
batch.max_bytes = 50_000_000
encoding.codec = "ndjson"
[sinks.s3_out.proxy]
enabled = true
[...]
[sinks.http_out]
type = "http"
inputs = [ "haproxy_filter" ]
[...]
compression = "none"
encoding.codec = "json"

[sinks.elasticsearch_out]
type = "elasticsearch"
inputs = [ "vector_parser" ]
[...]
suppress_type_name = true
mode = "bulk"

# Scrap internal metrics
[sources.metrics_in]
type = "internal_metrics"
scrape_interval_secs = 20
# Expose prometheus metrics
[sinks.metrics_out]
type = "prometheus"
inputs = [ "metrics_in" ]
[...]
default_namespace = "service"

Then there is the whole fleet of servers, also running Vector and sending the logs to the 2 Vector behind keepalived.

Here is the relevant and common part (click)
# Send all logs to vector server
[sinks.vector_out]
type = "vector"
inputs = [ "journald_parser", "syslog_parser"]
address = "VIP:8700"
version = "2"

The setup was running with Vector 0.21.1 for like a month and a half, and I had a pretty flat RAM usage at 1.3/4G for between 10k and 100k logs/s received by one of the vector VMs.

A few days ago, in one of these setups, one receiving vector started to get oom killed in a loop. (8G max). I tried to restart Vector, change the VIP to the other node, upgrade to 0.23.0, but still had the same issue.

Obvisouly it looked dependant of the incoming logs load. Since I was on weekend, I didn't really have time to dig more. Fast forward until today, where the pattern happened again. I had upgraded to 12G of RAM, but still. I also added a healthcheck to keepalived with a curl on the health api with a 2 (and now 10) seconds timeout.

When the RAM usage is big, it ususally takes more than 10s to answer the healthcheck, leading in a VIP switch. This currently happens every 2 to 10 minutes, leading to possibly some ARP cache issue and thus distributing the load between the 2 VMs.

The only thing I so (see the video below), was that the RAM usage is related to the difference between the Events In and Events Out of the vector_in source. I'm not sure how to tackle the issue to get more information though. Also, (seen on the video), looks like sometimes Vector is not doing anything (i.e no events)

Is it normal behaviour? (I'm going to try to share the load with a real loadbalancer, and most likey add more nodes)
Thanks and happy to help and debug!

simplescreenrecorder-2022-07-26_15.04.32.mp4

Configuration

No response

Version

0.23.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

1 participant