You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Hello!
So I'm running Vector in the follwing setup:
I have 2 VMs behind a keepalived IP. So only one node receive traffic (normally, see below).
These 2 nodes are running Vector with the following configuration (click)
[api]
enabled = trueaddress = "127.0.0.1:8686"# HTTP server to receive logs from agents
[sources.vector_in]
type = "vector"address = "0.0.0.0:8700"version = "2"# Parse logs receive from all the agents
[transforms.vector_parser]
inputs = [ "vector_in", "journald_parser" ]
type = "remap"source = ''' # some json parsing'''# Read from journald
[sources.journald_in]
type = "journald"since_now = true
[transforms.journald_filter]
type = "filter"inputs = [ "journald_in" ]
condition = 'true && some comparaisons)'# Parse journald logs
[transforms.journald_parser]
inputs = [ "journald_filter" ]
type = "remap"source = '''.log.ident = del(.SYSLOG_IDENTIFIER).log.pid = to_string(del(._PID)) ?? "".log.pri = to_syslog_level(to_int(del(.PRIORITY)) ?? 0) ?? "".log.@timestamp = del(.timestamp).log.message = del(.message).log.host = del(.host). = .log'''
[transforms.haproxy_filter]
type = "filter"inputs = [ "vector_parser" ]
condition = ".ident == \"haproxy\""
[sinks.s3_out]
type = "aws_s3"inputs = [ "haproxy_filter" ]
[...]
batch.max_bytes = 50_000_000encoding.codec = "ndjson"
[sinks.s3_out.proxy]
enabled = true
[...]
[sinks.http_out]
type = "http"inputs = [ "haproxy_filter" ]
[...]
compression = "none"encoding.codec = "json"
[sinks.elasticsearch_out]
type = "elasticsearch"inputs = [ "vector_parser" ]
[...]
suppress_type_name = truemode = "bulk"# Scrap internal metrics
[sources.metrics_in]
type = "internal_metrics"scrape_interval_secs = 20# Expose prometheus metrics
[sinks.metrics_out]
type = "prometheus"inputs = [ "metrics_in" ]
[...]
default_namespace = "service"
Then there is the whole fleet of servers, also running Vector and sending the logs to the 2 Vector behind keepalived.
Here is the relevant and common part (click)
# Send all logs to vector server
[sinks.vector_out]
type = "vector"inputs = [ "journald_parser", "syslog_parser"]
address = "VIP:8700"version = "2"
The setup was running with Vector 0.21.1 for like a month and a half, and I had a pretty flat RAM usage at 1.3/4G for between 10k and 100k logs/s received by one of the vector VMs.
A few days ago, in one of these setups, one receiving vector started to get oom killed in a loop. (8G max). I tried to restart Vector, change the VIP to the other node, upgrade to 0.23.0, but still had the same issue.
Obvisouly it looked dependant of the incoming logs load. Since I was on weekend, I didn't really have time to dig more. Fast forward until today, where the pattern happened again. I had upgraded to 12G of RAM, but still. I also added a healthcheck to keepalived with a curl on the health api with a 2 (and now 10) seconds timeout.
When the RAM usage is big, it ususally takes more than 10s to answer the healthcheck, leading in a VIP switch. This currently happens every 2 to 10 minutes, leading to possibly some ARP cache issue and thus distributing the load between the 2 VMs.
The only thing I so (see the video below), was that the RAM usage is related to the difference between the Events In and Events Out of the vector_in source. I'm not sure how to tackle the issue to get more information though. Also, (seen on the video), looks like sometimes Vector is not doing anything (i.e no events)
Is it normal behaviour? (I'm going to try to share the load with a real loadbalancer, and most likey add more nodes)
Thanks and happy to help and debug!
simplescreenrecorder-2022-07-26_15.04.32.mp4
Configuration
No response
Version
0.23.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered:
A note for the community
Problem
Hello!
So I'm running Vector in the follwing setup:
I have 2 VMs behind a keepalived IP. So only one node receive traffic (normally, see below).
These 2 nodes are running Vector with the following configuration (click)
Then there is the whole fleet of servers, also running Vector and sending the logs to the 2 Vector behind keepalived.
Here is the relevant and common part (click)
The setup was running with Vector 0.21.1 for like a month and a half, and I had a pretty flat RAM usage at 1.3/4G for between 10k and 100k logs/s received by one of the vector VMs.
A few days ago, in one of these setups, one receiving vector started to get oom killed in a loop. (8G max). I tried to restart Vector, change the VIP to the other node, upgrade to 0.23.0, but still had the same issue.
Obvisouly it looked dependant of the incoming logs load. Since I was on weekend, I didn't really have time to dig more. Fast forward until today, where the pattern happened again. I had upgraded to 12G of RAM, but still. I also added a healthcheck to keepalived with a curl on the health api with a 2 (and now 10) seconds timeout.
When the RAM usage is big, it ususally takes more than 10s to answer the healthcheck, leading in a VIP switch. This currently happens every 2 to 10 minutes, leading to possibly some ARP cache issue and thus distributing the load between the 2 VMs.
The only thing I so (see the video below), was that the RAM usage is related to the difference between the
Events In
andEvents Out
of thevector_in
source. I'm not sure how to tackle the issue to get more information though. Also, (seen on the video), looks like sometimes Vector is not doing anything (i.e no events)Is it normal behaviour? (I'm going to try to share the load with a real loadbalancer, and most likey add more nodes)
Thanks and happy to help and debug!
simplescreenrecorder-2022-07-26_15.04.32.mp4
Configuration
No response
Version
0.23.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: