-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid measurements found on Chronograf hours after setting up #3157
Comments
Hi @sebastianarena , thanks for writing in ! Yikes, there must be some sort of line protocol issue when telegraf sends data to influx. Also, I'd like to see a bit of the output from the telegraf plugins. To get some data out would your run this two or three times and send the information: telegraf --config telegraf.conf -test |
There is a bug that looks similar to this which has not been isolated. It seems to occur when using HTTP2 and nginx. #2854 Any chance you are also using HTTP2 and nginx? |
Hi! YES! I'm using NGINX and HTTP2 as default! |
Just did a couple of tests.
Have a look:
Here's my telegraf.conf:
Here's a sample output like @goller wanted, specifically from the server that seems to be showing more problems in the logs:
|
On the failing host can you try to isolate the error to a single plugin? |
It is possible that #3142 is related. |
@sebastianarena #3142 is fixed. I think it is a long shot, but can you test with 1.4.0-rc2 to see if there is any improvement? |
Hello all. Regarding the host that causes trouble, it was all of them. I have centralized all our system logs, and I could clearly see that every server got an HTTP 400 from Influx at some moment, which leads to the invalid measurements. Regarding the exact plugin that is causing the issue, I only have the basic stuff, plus Mongo and Redis. I was suspecting those might be trouble, but decided to try something else instead. I shut down all Telegraf remote servers, and just left the TICK stack on one server, and one more remotely pushing measurements. So far, 1h after that, everything is looking ok. Which seems to indicate that is not related to Telegraf, but to Influx per se, that gets overloaded with multiple servers sending their data in at the same time. Which worries me even more. I'll leave this testing open for the time being, and will add hosts one at a time until it crashes again. |
UDP should be good for performance but obviously less reliable. In 1.4 we also are adding gzip support which should be an improvement if you are sending over a slow connection. From what I have read, nginx only does HTTP/2 on the client side while the proxy side is HTTP/1.1, so it seems to me that this must be either a bug in Telegraf or nginx, since I haven't had any reports of this happening in HTTP/1.1 mode. |
I don't think nginx is involved.. because with a single host everything works fine. The end of this was that I played the last couple of hours with interval, flush, and the jitters. I slowed down the intake of inputs, and now everything seems to be working fine. Something is wrong somewhere, and possibly gzip would be an awesome addition to speed things up. For now, I think I'm ok. |
Lets merge this issue with #2854 |
Hello team.
I've setup a TICK stack over a week ago, putting only one server to track everything, and play with the platform. Loved it, and decided to install a telegraf server on each of the multiple servers I manage. Each server was configured to track the basic Telegraf information, and mongo + redis inputs.
Leaving that to work over the weekend, I came to see a lot of invalid measurements on Chronograf.
This is clearly 127.0.0.1:27017 reporting from mongo, all cut up into pieces for some reason
I wiped the telegraf database, and started from scratch.
After several hours, the problem happens again.
To my recollection, it seems that measurements are being "cut off" at some point, and from then on it generates a lot of invalid data, that is not well parsed by InfluxDB, but somehow stored anyways.
I found some logs that state that the data was unable to be parsed correctly, where it shows cut off data of some sort.
Is there anything in the configuration I should do to prevent this from happening? Is this a bug?
Thanks!
The text was updated successfully, but these errors were encountered: