Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid measurements found on Chronograf hours after setting up #3157

Closed
sebastianarena opened this issue Aug 23, 2017 · 11 comments
Closed

Invalid measurements found on Chronograf hours after setting up #3157

sebastianarena opened this issue Aug 23, 2017 · 11 comments
Labels
area/influxdb bug unexpected problem or unintended behavior

Comments

@sebastianarena
Copy link

sebastianarena commented Aug 23, 2017

Hello team.

I've setup a TICK stack over a week ago, putting only one server to track everything, and play with the platform. Loved it, and decided to install a telegraf server on each of the multiple servers I manage. Each server was configured to track the basic Telegraf information, and mongo + redis inputs.

Leaving that to work over the weekend, I came to see a lot of invalid measurements on Chronograf.

screen shot 2017-08-22 at 23 46 08

screen shot 2017-08-22 at 21 04 39

screen shot 2017-08-22 at 23 51 50

This is clearly 127.0.0.1:27017 reporting from mongo, all cut up into pieces for some reason

I wiped the telegraf database, and started from scratch.
After several hours, the problem happens again.

To my recollection, it seems that measurements are being "cut off" at some point, and from then on it generates a lot of invalid data, that is not well parsed by InfluxDB, but somehow stored anyways.

I found some logs that state that the data was unable to be parsed correctly, where it shows cut off data of some sort.

Is there anything in the configuration I should do to prevent this from happening? Is this a bug?

Thanks!

2017-08-23T02:01:40Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'objects=1185290i,indexes=51i,type="db_stat",data_size=616619670i,index_size=87994368i,collections=14i,avg_obj_size=520.2268390014258,ok=1i,storage_size=207269888i,num_extents=0i 1503453700000000000': invalid field format dropped=0]
2017-08-23T02:03:20Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '29224960i,indexes=37i,ok=1i,index_size=24461312i 1503453800000000000': invalid field format dropped=0]
2017-08-23T02:03:50Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'e_size=29224960i,num_extents=0i,index_size=24461312i,collections=9i,indexes=37i,type="db_stat" 1503453830000000000': invalid field format dropped=0]
2017-08-23T02:04:30Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '=520.2268702174151,storage_size=207269888i,num_extents=0i,type="db_stat",data_size=616619707i,indexes=51i,ok=1i 1503453870000000000': invalid field format dropped=0]
2017-08-23T02:09:50Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'orage_size=29224960i,num_extents=0i,index_size=24461312i,collections=9i,ok=1i 1503454190000000000': invalid field format dropped=0]
2017-08-23T02:10:20Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'tat" 1503454220000000000': invalid field format dropped=0]
2017-08-23T02:13:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'ize=36864i,index_size=229376i,objects=2i,ok=1i 1503454390000000000': invalid field format dropped=0]
2017-08-23T02:16:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '87733908,num_extents=0i,data_size=54886210i,storage_size=29224960i,indexes=37i,ok=1i,index_size=24461312i,collections=9i,objects=201254i 1503454570000000000': invalid field format dropped=0]
@goller
Copy link
Contributor

goller commented Aug 23, 2017

Hi @sebastianarena , thanks for writing in !

Yikes, there must be some sort of line protocol issue when telegraf sends data to influx.
Do you mind sending the telegraf.conf?

Also, I'd like to see a bit of the output from the telegraf plugins. To get some data out would your run this two or three times and send the information:

telegraf --config telegraf.conf -test

@danielnelson
Copy link
Contributor

There is a bug that looks similar to this which has not been isolated. It seems to occur when using HTTP2 and nginx. #2854

Any chance you are also using HTTP2 and nginx?

@sebastianarena
Copy link
Author

Hi! YES! I'm using NGINX and HTTP2 as default!
Let me try removing HTTP2 and checking if it solves the problem.

@sebastianarena
Copy link
Author

Just did a couple of tests.
It seems to have worked, but not all the way.
I have one server telling me at least once a minute that there's an HTTP error 400, and automatically there's an invalid measurement created.
For the record:

  • I'm still using NGINX to proxy port 80 to the internal port
  • I'm using HTTPS on NGINX
  • I'm NOT using HTTP2 now after @danielnelson suggestion
    Maybe some of this is still problematic.

Have a look:

2017-08-23T21:03:30Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '39287i,interrupts=341323616i,context_switches=560217853i,boot_time=1502293697i 1503522210000000000': invalid field format dropped=0]

screen shot 2017-08-23 at 18 05 32

Here's my telegraf.conf:

[global_tags]
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
  urls = ["<redacted>"] # required
  database = "telegraf" # required
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  username = "<redacted>"
  password = "<redacted>"
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.internal]]
   collect_memstats = true
[[inputs.mongodb]]
   servers = ["127.0.0.1:27017"]
   gather_perdb_stats = true
[[inputs.redis]]
   servers = ["tcp://localhost:6379"]

Here's a sample output like @goller wanted, specifically from the server that seems to be showing more problems in the logs:

2017/08/23 21:08:42 I! Using config file: /etc/telegraf/telegraf.conf
* Plugin: inputs.mongodb, Collection 1
2017-08-23T21:08:42Z W! [inputs.mongodb] Using "mongodb://127.0.0.1:27017" as connection URL; please update your configuration to use an URL
* Plugin: inputs.mongodb, Collection 2
> mongodb,hostname=127.0.0.1:27017,host=redacted net_out_bytes=24025i,deletes_per_sec=0i,vsize_megabytes=1119i,percent_cache_used=5.3,getmores_per_sec=0i,active_writes=0i,queued_reads=0i,ttl_deletes_per_sec=0i,active_reads=0i,commands_per_sec=7i,updates_per_sec=0i,ttl_passes_per_sec=0i,resident_megabytes=193i,flushes_per_sec=0i,queries_per_sec=0i,queued_writes=0i,open_connections=8i,inserts_per_sec=0i,jumbo_chunks=0i,percent_cache_dirty=0,net_in_bytes=472i 1503522523000000000
> mongodb_db_stats,host=redacted,db_name=admin,hostname=127.0.0.1:27017 ok=1i,objects=1i,type="db_stat",num_extents=0i,avg_obj_size=59,data_size=59i,indexes=2i,index_size=32768i,storage_size=16384i,collections=1i 1503522523000000000
> mongodb_db_stats,hostname=127.0.0.1:27017,host=redacted,db_name=redacted collections=7i,objects=3295i,data_size=29828269i,num_extents=0i,ok=1i,index_size=569344i,type="db_stat",avg_obj_size=9052.585432473445,storage_size=20119552i,indexes=28i 1503522523000000000
> mongodb_db_stats,hostname=127.0.0.1:27017,host=redacted,db_name=local type="db_stat",collections=1i,storage_size=16384i,ok=1i,index_size=16384i,objects=1i,avg_obj_size=1657,data_size=1657i,num_extents=0i,indexes=1i 1503522523000000000
* Plugin: inputs.cpu, Collection 1
* Plugin: inputs.cpu, Collection 2
> cpu,host=redacted,cpu=cpu0 usage_softirq=0,usage_user=0,usage_system=0,usage_nice=0,usage_iowait=0,usage_irq=0,usage_idle=100,usage_steal=0,usage_guest=0,usage_guest_nice=0 1503522524000000000
> cpu,cpu=cpu1,host=redacted usage_irq=0,usage_guest_nice=0,usage_nice=0,usage_system=0,usage_idle=100,usage_iowait=0,usage_softirq=0,usage_steal=0,usage_guest=0,usage_user=0 1503522524000000000
> cpu,cpu=cpu-total,host=redacted usage_system=0,usage_idle=100,usage_irq=0,usage_steal=0,usage_guest=0,usage_user=0,usage_nice=0,usage_iowait=0,usage_softirq=0,usage_guest_nice=0 1503522524000000000
* Plugin: inputs.kernel, Collection 1
> kernel,host=redacted interrupts=341422671i,context_switches=560370475i,boot_time=1502293697i,processes_forked=1039626i 1503522524000000000
* Plugin: inputs.mem, Collection 1
> mem,host=redacted buffered=83603456i,inactive=1447612416i,used_percent=31.371947721004585,available=2841600000i,free=704352256i,cached=2127491072i,available_percent=68.62805227899541,total=4140580864i,used=1298980864i,active=1696927744i 1503522524000000000
* Plugin: inputs.processes, Collection 1
> processes,host=redacted sleeping=260i,total=260i,unknown=0i,total_threads=540i,running=0i,zombies=0i,stopped=0i,dead=0i,paging=0i,blocked=0i 1503522524000000000
* Plugin: inputs.swap, Collection 1
> swap,host=redacted used=0i,free=536866816i,used_percent=0,total=536866816i 1503522524000000000
> swap,host=redacted in=0i,out=0i 1503522524000000000
* Plugin: inputs.system, Collection 1
> system,host=redacted load1=0.01,load5=0.04,load15=0.05,n_users=1i,n_cpus=2i 1503522524000000000
> system,host=redacted uptime=1228826i,uptime_format="14 days,  5:20" 1503522524000000000
* Plugin: inputs.internal, Collection 1
> internal_memstats,host=redacted sys_bytes=12232952i,mallocs=57067i,heap_sys_bytes=9011200i,alloc_bytes=3825240i,heap_objects=13631i,num_gc=3i,heap_alloc_bytes=3825240i,heap_idle_bytes=3883008i,heap_in_use_bytes=5128192i,heap_released_bytes=0i,frees=43436i,pointer_lookups=1182i,total_alloc_bytes=9563888i 1503522524000000000
> internal_gather,input=internal,host=redacted metrics_gathered=1i 1503522524000000000
> internal_agent,host=redacted metrics_dropped=0i,metrics_gathered=15i,gather_errors=0i,metrics_written=0i 1503522524000000000
> internal_write,output=influxdb,host=redacted buffer_limit=10000i,write_time_ns=0i,metrics_written=0i,metrics_filtered=0i,buffer_size=0i 1503522524000000000
> internal_gather,input=processes,host=redacted metrics_gathered=1i 1503522524000000000
> internal_gather,host=redacted,input=swap metrics_gathered=2i 1503522524000000000
> internal_gather,input=system,host=redacted metrics_gathered=2i 1503522524000000000
> internal_gather,input=disk,host=redacted metrics_gathered=0i 1503522524000000000
> internal_gather,input=diskio,host=redacted metrics_gathered=0i 1503522524000000000
> internal_gather,host=redacted,input=mongodb metrics_gathered=4i 1503522524000000000
> internal_gather,input=cpu,host=redacted metrics_gathered=3i 1503522524000000000
> internal_gather,input=kernel,host=redacted metrics_gathered=1i 1503522524000000000
> internal_gather,input=mem,host=redacted metrics_gathered=1i 1503522524000000000
> internal_gather,input=redis,host=redacted metrics_gathered=0i 1503522524000000000
* Plugin: inputs.redis, Collection 1
> redis_keyspace,host=redacted,server=localhost,port=6379,replication_role=master,database=db0 keys=91i,expires=88i,avg_ttl=57713995i 1503522524000000000
> redis,server=localhost,port=6379,replication_role=master,host=redacted uptime=1227401i,rdb_current_bgsave_time_sec=-1i,rdb_last_save_time=1502295122i,sync_full=0i,keyspace_misses=2271984i,loading=0i,rdb_changes_since_last_save=4549087i,aof_pending_rewrite=0i,used_cpu_user=722.27,aof_rewrite_buffer_length=0i,aof_pending_bio_fsync=0i,evicted_keys=0i,keyspace_hits=2270404i,master_repl_offset=0i,maxmemory_policy="noeviction",aof_enabled=1i,aof_rewrite_in_progress=0i,sync_partial_ok=0i,keyspace_hitrate=0.49982608266841144,used_memory=827424i,rdb_last_bgsave_time_sec=-1i,aof_current_size=41783033i,expired_keys=3601i,client_biggest_input_buf=0i,mem_fragmentation_ratio=6.35,rejected_connections=0i,repl_backlog_active=0i,used_memory_peak=1650944i,aof_rewrite_scheduled=0i,pubsub_channels=1i,repl_backlog_size=1048576i,aof_current_rewrite_time_sec=-1i,instantaneous_output_kbps=0.03,connected_slaves=0i,repl_backlog_histlen=0i,used_cpu_sys=1093.44,client_longest_output_list=0i,aof_buffer_length=0i,total_commands_processed=11419912i,repl_backlog_first_byte_offset=0i,rdb_bgsave_in_progress=0i,aof_last_write_status="ok",aof_delayed_fsync=0i,total_net_output_bytes=156369890i,sync_partial_err=0i,clients=14i,used_memory_rss=5255168i,used_cpu_user_children=0,total_connections_received=51764i,instantaneous_ops_per_sec=12i,instantaneous_input_kbps=0.83,used_memory_lua=39936i,maxmemory=0i,rdb_last_bgsave_status="ok",pubsub_patterns=0i,latest_fork_usec=204i,blocked_clients=9i,total_system_memory=4140580864i,aof_last_bgrewrite_status="ok",aof_base_size=21516i,migrate_cached_sockets=0i,used_cpu_sys_children=0,lru_clock=10350299i,rdb_last_save_time_elapsed=1227401i,aof_last_rewrite_time_sec=0i,total_net_input_bytes=770295625i,cluster_enabled=0i 1503522524000000000
* Plugin: inputs.disk, Collection 1
> disk,device=root,fstype=ext4,host=redacted,path=/ used=2692468736i,used_percent=5.649412439311064,inodes_total=2954880i,inodes_free=2870082i,inodes_used=84798i,total=50226184192i,free=44966801408i 1503522524000000000
* Plugin: inputs.diskio, Collection 1
> diskio,name=sdb,host=redacted read_bytes=2256896i,write_bytes=0i,read_time=63i,write_time=0i,io_time=50i,writes=0i,iops_in_progress=0i,reads=88i 1503522524000000000
> diskio,name=sda,host=redacted write_time=1548376i,io_time=1486360i,reads=12704i,read_bytes=354658304i,write_bytes=96178057216i,read_time=18120i,writes=4443201i,iops_in_progress=0i 1503522524000000000

@danielnelson
Copy link
Contributor

On the failing host can you try to isolate the error to a single plugin?

@danielnelson
Copy link
Contributor

It is possible that #3142 is related.

@danielnelson
Copy link
Contributor

@sebastianarena #3142 is fixed. I think it is a long shot, but can you test with 1.4.0-rc2 to see if there is any improvement?

@danielnelson danielnelson added area/influxdb bug unexpected problem or unintended behavior labels Aug 23, 2017
@sebastianarena
Copy link
Author

Hello all.
Here's what I'm trying to do at the moment.

Regarding the host that causes trouble, it was all of them. I have centralized all our system logs, and I could clearly see that every server got an HTTP 400 from Influx at some moment, which leads to the invalid measurements.

Regarding the exact plugin that is causing the issue, I only have the basic stuff, plus Mongo and Redis. I was suspecting those might be trouble, but decided to try something else instead.

I shut down all Telegraf remote servers, and just left the TICK stack on one server, and one more remotely pushing measurements. So far, 1h after that, everything is looking ok.

Which seems to indicate that is not related to Telegraf, but to Influx per se, that gets overloaded with multiple servers sending their data in at the same time. Which worries me even more.

I'll leave this testing open for the time being, and will add hosts one at a time until it crashes again.
Any ideas? Maybe switch to UDP instead of HTTP?

@danielnelson
Copy link
Contributor

UDP should be good for performance but obviously less reliable. In 1.4 we also are adding gzip support which should be an improvement if you are sending over a slow connection.

From what I have read, nginx only does HTTP/2 on the client side while the proxy side is HTTP/1.1, so it seems to me that this must be either a bug in Telegraf or nginx, since I haven't had any reports of this happening in HTTP/1.1 mode.

@sebastianarena
Copy link
Author

I don't think nginx is involved.. because with a single host everything works fine.
The problem starts happening when I start adding more hosts, and more measurements.

The end of this was that I played the last couple of hours with interval, flush, and the jitters. I slowed down the intake of inputs, and now everything seems to be working fine.

Something is wrong somewhere, and possibly gzip would be an awesome addition to speed things up.

For now, I think I'm ok.
THANKS

@danielnelson
Copy link
Contributor

Lets merge this issue with #2854

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants