-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kapacitor #9
Comments
EvaluationWe must answer these questions to determine whether it is a viable option for the CEP in our v2 architecture. Scaling
Alarm language
Alarming functionalities
Notification methods
Data generation
Interaction methods
Esper
|
Option 1: Kafka Source -> Influxdb & KapacitorWe first store metrics in kafka. Kapacitor will stream directly from kafka. In this case we could partition on account (but use more than the 64 partitions we have right now). This is what kapacitor would stream from, but we would have something in between kafka and influx to partition it further. This means kapacitor tasks would be able to handle cross metric/check alarms for a single account with ease. Although if we had to generate new metrics and alert on them, we may have to push back into the original kafka topic that we read from, which might not be ideal especially since there isn't currently a kafkaOut node like there is an httpOut and influxDBOut... although, I'm not sure if the task would need access to those from the feed as it might store all it's "windows" in memory when calculating averages over time? For any cross-account checks we would have to consume from influxdb. We would probably have tasks configured on each account which post average metrics rather than performing these tasks on the raw data. Assumptions for this to work
|
Option 2: Kafka -> Influxdb -> KapacitorKafka will only act as a buffer before the data enters influxdb. It may also feed data into the warehouser. Kapacitor will read directly from influx which is more "natural" for that ecosystem. In this case we would partition influxdbs by For cross-checkType alarms, we would add a task to the relevant db instance that filters out what is needed for the alarm and forward that to another db specific for "aggregated" alarms. The task with the main alarm logic would only query the aggregated db instance. Assumptions for this to work / potential problems
|
Example: Cross Influxdb TaskWe have two influxdb hosts: One telegraf configured with a Kapacitor configured to use both influxdb hosts:
A task can be created to combine both metrics into one alarm:
In this case we would likely need to make use of As far as I'm aware you cannot specifically query a certain host within a task, only a certain database. If you do not specify the database in the If we define database names like |
What is it and how does it work?
Open source framework for processing, monitoring, and alerting on time series data
https://github.com/influxdata/kapacitor
https://docs.influxdata.com/kapacitor/v1.3/
Configuration
Interaction methods
curl -v kapacitor:9092/kapacitor/v1/:routes
kapacitor
which uses the apiData sources
If we don't commit to using influx we probably wouldn't want to use Kapacitor. Influx will always be the optimal data source.
Tasks (what they call alarms)
Alerting
.alert
node.consecutiveCount
and RBA's bake time.StateDuration
that defaults to 0, but that they can specify themselves to extend the default. Alternatively, we can perform validation in the api layer and use one value - that may be saner.Failure Scenarios
Debugging
kapacitor show TASK_NAME
command will output performance stats that you can use to determine how well a given task might scale. But in general Kapacitor consumes CPU and RAM resources more than disk.Anomoly detection / Predictive modeling
Backups
Not sure if we'd use this since we'd probably want to reload tasks from Cassandra.
Known problems for us
system
plugin sends bothload
anduptime
stats, but it sendsload
on one line anduptime
on another. In the end this ends up as one line in Influx since the timestamps match, but kapacitor receives each individually forstream
tasks.batch
tasks would not be affected.isPresent()
how do we handle cases where an expected metric never arrives? Currently our alarms return critical if the criteria tests a metric key that doesn't exist.default
because we don't want to pre-populate a metric if it will arrive shortly, as that could cause the alarm to change to an undesired state, we only want to know if it never arrives.Clustering
Without paying for the enterprise version, influxdb will not have clustering abilities. I performed a small test to see how telegraf/kapacitor handle a scenario with multiple dbs of the same name.
One telegraf instance was configured to send to two influxdb hosts. The write consistency is set to "any" so it only ends up writing to one host at a time.
A streaming task was enabled in kapacitor to log the data it saw incoming:
This would log a message on every telegraf output regardless of the db host it ended up in.
A batch task was then enabled to query influx for the count of how many metrics it had seem.
This task did not work as desired. On each run of the query it would use a different host and therefore return different results each time. The final count returned was about half of the expected value.
TO DO
Determine whether streaming tasks are all handled in memory or whether it has to query influx after new data has been received. i.e. do these tasks only get fed data from influx or can they also reach back in to influx when performing more complex evaluations.
Problem
Without clustering abilities it limits our ability on how much we can utilize kapacitor tasks. It also means we have to develop our own methods for replication and failover, as we cannot configure clients to read/write with a quorum consistency.
Other things to be aware of
01:00:00.000
,01:01:00.000
,01:02:00.000
, etc.The text was updated successfully, but these errors were encountered: