-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple TICKscript that will always error #1217
Comments
I think the easiest solution here will be to fix Telegraf, which is really sending one point on two separate lines. In my opinion, this type of formatting is also technically not valid line protocol, from the docs:
Where a single point should be represented by a single line. The only thing preventing the first point from being overwritten is the fact that it's using different field keys. Trying to get InfluxDB and/or Kapacitor to account for this edge case will require a lot of effort to solve in a reliable way. |
I agree that Telegraf is the easiest way to fix the issue, but as @sparrc notes, there are multiple reasons why I metric may be split.
|
I think there's a larger question here about how/where we should talk about these kinds of issues that have cross platform implications. |
I see that the documentation seems to state that they have to be separated, but InfluxDB accepts metrics that are split across multiple lines/inserts. Breaking this behavior would be a huge change. For example, this works in InfluxDB (and is how Telegraf expects InfluxDB to work):
|
I believe that I stated in my original report that batch kapacitor batch processing works properly, as query is offloaded to influx. However, stream nodes will fail. I understand perfectly well the reasoning for such split. For example, today I witnessed same behavior with ceph input plugin which produces great deal more metrics. Forwarding too many fields on a single line will be unfeasible at one point. Nevertheless, this issue is very confusing from a user standpoint. Stream is the recommended operating mode for kapacitor, and most documentation examples are built upon that. One would assume that tickscript which works fine with CPU measurement should yield a similar result when modified for system. But this simply isn't the case. It is also likely to cause problems for people who try to use tick script templates and variables. |
I do have two guesses at a solution but first i think it is important to frame the problem in how Kapacitor and InfluxDB define "points". InfluxDB's behavior is to implicitly merge points that have the same timestamp. This is a result of how the data is stored on disk. Kapacitor defines a point based on the definition given in the protocol:
In other words Kapacitor defines a point as, any number of fields + timestamp that arrived in the same line. So here are my two suggestions:
|
Unfortunately this is not true. If we assumed this to be true it would work most of the time, but would break in two scenarios that I can think of: Telegraf will send a batch whenever it's internal buffer fills up. If the user has a large number of metrics passing through telegraf, then it will fill frequently, and metrics could have the same timestamp across many batches. The other way this would break is one I already mentioned: UDP packets. We would need to change the behavior of telegraf to reject sending UDP packets over a certain size rather than splitting them up into multiple points. I'm not sure what the best solution for this is. Seems like we might want to make a separate output plugin for kapacitor that would do it's own buffering, holding onto multiple batches until it sees a metric with a new timestamp. |
I personally like the idea of the stream node buffering points. |
@phemmer Interesting I would have expected the cost to buffer would be too much. If you want to buffer and reorder points how long are you willing to buffer the data and at what penalty to the latency of your data? |
IMO buffering data like this should not be the default functionality. It should be opt-in, we could add a |
I would think it be configurable, that way each user can tailor it to their use case. Something like There are 2 main cases I can think of where buffering would be very useful for fixing out-of-order points.
I think the main reason this isn't more of a problem is that most people group by host, so within the same host the points are in order. But if you want to operate on data across hosts, then you're in a major pickle. We've tried to do this and have had to change designs (sometimes by not using kapacitor) because of the issue. |
@phemmer Thanks this is great feedback! |
Some additional thoughts on the buffer idea. The buffering should be done via a node, and not part of Also nodes across different tick scripts that are configured the same way, with parents that are also configured the same way can be shared. This way N scripts buffering the same points don't use N times the memory.
Should use the former. If you have a buffer time of 10 seconds, and you somehow receive a point timestamped a year in the future, everything immediately becomes unbuffered because they're more than 10 seconds before this errant point. And this errant point will never leave the buffer (or at least it won't until a year goes by), because nothing is newer than it to flush it out. |
Is there any way to work around this until it's fixed? |
@austinbutler The best work around I can currently think of is to use a batch task instead of a stream task. |
Just got another report of a similar issue with the In the meantime, for
This may need a bit more tweaking to make it work, but the general idea is there. |
@markuskont @austinbutler ^^^^ |
@desa I agree with the points sparrc made, to me it seems any real fix must be done in Kapacitor. However, if we are adding metrics separately in Telegraf we should probably change it. It looks like the mysql plugin does indeed do this, can you open an issue? |
@desa I worry that would just skip some points, is that not the case? In other words, it's not just making it so the error goes away is it? Will it wait to do the eval until both fields come in? |
@phemmer What is the difference between using a window and a buffer? Aren't they doing more or less the same namely capturing data received over some time? |
@ss682 The window node would have to hold at least one point in its buffer at all times. Possibly more depending on how out-of-order your points may be (and this sizing would have to be hard coded in the script, which may not work in dynamic environments). This means that your data will be delayed by at least whatever the interval between points is. Even with a minimal window of 1 point, kapacitor already uses point buffering on a lot of different nodes, so that once you've chained several of them together, your data is horribly delayed. This is especially bad when your data comes in infrequently. The proposed buffer node would be wall-clock time based, meaning that instead of holding N points it holds X seconds of points. The duration would likely be in the millisecond range, and the impact would be minimal. I also don't believe a window node does point re-ordering. But not sure on this, and it could be adjusted to do so if it does not. |
Duplicate of influxdata/telegraf#2444
Since the data for a series is split into two lines, InfluxDB forwards it on to Kapacitor as two lines and therefore its possible to write a TICKscript that will perpetually fail
with
will perpetually error out, even though all of the data would exist.
As I see it, there are four ways to solve this problem.
I had initially directed @markuskont to opening an issue on Telegraf, but there's definitely more than one way to solve this issue.
The text was updated successfully, but these errors were encountered: