Invalid measurements found on Chronograf hours after setting up #3157

sebastianarena · 2017-08-23T02:53:03Z

Hello team.

I've setup a TICK stack over a week ago, putting only one server to track everything, and play with the platform. Loved it, and decided to install a telegraf server on each of the multiple servers I manage. Each server was configured to track the basic Telegraf information, and mongo + redis inputs.

Leaving that to work over the weekend, I came to see a lot of invalid measurements on Chronograf.

This is clearly 127.0.0.1:27017 reporting from mongo, all cut up into pieces for some reason

I wiped the telegraf database, and started from scratch.
After several hours, the problem happens again.

To my recollection, it seems that measurements are being "cut off" at some point, and from then on it generates a lot of invalid data, that is not well parsed by InfluxDB, but somehow stored anyways.

I found some logs that state that the data was unable to be parsed correctly, where it shows cut off data of some sort.

Is there anything in the configuration I should do to prevent this from happening? Is this a bug?

Thanks!

2017-08-23T02:01:40Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'objects=1185290i,indexes=51i,type="db_stat",data_size=616619670i,index_size=87994368i,collections=14i,avg_obj_size=520.2268390014258,ok=1i,storage_size=207269888i,num_extents=0i 1503453700000000000': invalid field format dropped=0]
2017-08-23T02:03:20Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '29224960i,indexes=37i,ok=1i,index_size=24461312i 1503453800000000000': invalid field format dropped=0]
2017-08-23T02:03:50Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'e_size=29224960i,num_extents=0i,index_size=24461312i,collections=9i,indexes=37i,type="db_stat" 1503453830000000000': invalid field format dropped=0]
2017-08-23T02:04:30Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '=520.2268702174151,storage_size=207269888i,num_extents=0i,type="db_stat",data_size=616619707i,indexes=51i,ok=1i 1503453870000000000': invalid field format dropped=0]
2017-08-23T02:09:50Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'orage_size=29224960i,num_extents=0i,index_size=24461312i,collections=9i,ok=1i 1503454190000000000': invalid field format dropped=0]
2017-08-23T02:10:20Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'tat" 1503454220000000000': invalid field format dropped=0]
2017-08-23T02:13:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'ize=36864i,index_size=229376i,objects=2i,ok=1i 1503454390000000000': invalid field format dropped=0]
2017-08-23T02:16:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '87733908,num_extents=0i,data_size=54886210i,storage_size=29224960i,indexes=37i,ok=1i,index_size=24461312i,collections=9i,objects=201254i 1503454570000000000': invalid field format dropped=0]

The text was updated successfully, but these errors were encountered:

goller · 2017-08-23T14:35:52Z

Hi @sebastianarena , thanks for writing in !

Yikes, there must be some sort of line protocol issue when telegraf sends data to influx.
Do you mind sending the telegraf.conf?

Also, I'd like to see a bit of the output from the telegraf plugins. To get some data out would your run this two or three times and send the information:

telegraf --config telegraf.conf -test

danielnelson · 2017-08-23T19:02:11Z

There is a bug that looks similar to this which has not been isolated. It seems to occur when using HTTP2 and nginx. #2854

Any chance you are also using HTTP2 and nginx?

sebastianarena · 2017-08-23T19:18:46Z

Hi! YES! I'm using NGINX and HTTP2 as default!
Let me try removing HTTP2 and checking if it solves the problem.

sebastianarena · 2017-08-23T21:11:11Z

Just did a couple of tests.
It seems to have worked, but not all the way.
I have one server telling me at least once a minute that there's an HTTP error 400, and automatically there's an invalid measurement created.
For the record:

I'm still using NGINX to proxy port 80 to the internal port
I'm using HTTPS on NGINX
I'm NOT using HTTP2 now after @danielnelson suggestion
Maybe some of this is still problematic.

Have a look:

2017-08-23T21:03:30Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '39287i,interrupts=341323616i,context_switches=560217853i,boot_time=1502293697i 1503522210000000000': invalid field format dropped=0]

Here's my telegraf.conf:

[global_tags]
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
  urls = ["<redacted>"] # required
  database = "telegraf" # required
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  username = "<redacted>"
  password = "<redacted>"
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.internal]]
   collect_memstats = true
[[inputs.mongodb]]
   servers = ["127.0.0.1:27017"]
   gather_perdb_stats = true
[[inputs.redis]]
   servers = ["tcp://localhost:6379"]

Here's a sample output like @goller wanted, specifically from the server that seems to be showing more problems in the logs:

2017/08/23 21:08:42 I! Using config file: /etc/telegraf/telegraf.conf
* Plugin: inputs.mongodb, Collection 1
2017-08-23T21:08:42Z W! [inputs.mongodb] Using "mongodb://127.0.0.1:27017" as connection URL; please update your configuration to use an URL
* Plugin: inputs.mongodb, Collection 2
> mongodb,hostname=127.0.0.1:27017,host=redacted net_out_bytes=24025i,deletes_per_sec=0i,vsize_megabytes=1119i,percent_cache_used=5.3,getmores_per_sec=0i,active_writes=0i,queued_reads=0i,ttl_deletes_per_sec=0i,active_reads=0i,commands_per_sec=7i,updates_per_sec=0i,ttl_passes_per_sec=0i,resident_megabytes=193i,flushes_per_sec=0i,queries_per_sec=0i,queued_writes=0i,open_connections=8i,inserts_per_sec=0i,jumbo_chunks=0i,percent_cache_dirty=0,net_in_bytes=472i 1503522523000000000
> mongodb_db_stats,host=redacted,db_name=admin,hostname=127.0.0.1:27017 ok=1i,objects=1i,type="db_stat",num_extents=0i,avg_obj_size=59,data_size=59i,indexes=2i,index_size=32768i,storage_size=16384i,collections=1i 1503522523000000000
> mongodb_db_stats,hostname=127.0.0.1:27017,host=redacted,db_name=redacted collections=7i,objects=3295i,data_size=29828269i,num_extents=0i,ok=1i,index_size=569344i,type="db_stat",avg_obj_size=9052.585432473445,storage_size=20119552i,indexes=28i 1503522523000000000
> mongodb_db_stats,hostname=127.0.0.1:27017,host=redacted,db_name=local type="db_stat",collections=1i,storage_size=16384i,ok=1i,index_size=16384i,objects=1i,avg_obj_size=1657,data_size=1657i,num_extents=0i,indexes=1i 1503522523000000000
* Plugin: inputs.cpu, Collection 1
* Plugin: inputs.cpu, Collection 2
> cpu,host=redacted,cpu=cpu0 usage_softirq=0,usage_user=0,usage_system=0,usage_nice=0,usage_iowait=0,usage_irq=0,usage_idle=100,usage_steal=0,usage_guest=0,usage_guest_nice=0 1503522524000000000
> cpu,cpu=cpu1,host=redacted usage_irq=0,usage_guest_nice=0,usage_nice=0,usage_system=0,usage_idle=100,usage_iowait=0,usage_softirq=0,usage_steal=0,usage_guest=0,usage_user=0 1503522524000000000
> cpu,cpu=cpu-total,host=redacted usage_system=0,usage_idle=100,usage_irq=0,usage_steal=0,usage_guest=0,usage_user=0,usage_nice=0,usage_iowait=0,usage_softirq=0,usage_guest_nice=0 1503522524000000000
* Plugin: inputs.kernel, Collection 1
> kernel,host=redacted interrupts=341422671i,context_switches=560370475i,boot_time=1502293697i,processes_forked=1039626i 1503522524000000000
* Plugin: inputs.mem, Collection 1
> mem,host=redacted buffered=83603456i,inactive=1447612416i,used_percent=31.371947721004585,available=2841600000i,free=704352256i,cached=2127491072i,available_percent=68.62805227899541,total=4140580864i,used=1298980864i,active=1696927744i 1503522524000000000
* Plugin: inputs.processes, Collection 1
> processes,host=redacted sleeping=260i,total=260i,unknown=0i,total_threads=540i,running=0i,zombies=0i,stopped=0i,dead=0i,paging=0i,blocked=0i 1503522524000000000
* Plugin: inputs.swap, Collection 1
> swap,host=redacted used=0i,free=536866816i,used_percent=0,total=536866816i 1503522524000000000
> swap,host=redacted in=0i,out=0i 1503522524000000000
* Plugin: inputs.system, Collection 1
> system,host=redacted load1=0.01,load5=0.04,load15=0.05,n_users=1i,n_cpus=2i 1503522524000000000
> system,host=redacted uptime=1228826i,uptime_format="14 days,  5:20" 1503522524000000000
* Plugin: inputs.internal, Collection 1
> internal_memstats,host=redacted sys_bytes=12232952i,mallocs=57067i,heap_sys_bytes=9011200i,alloc_bytes=3825240i,heap_objects=13631i,num_gc=3i,heap_alloc_bytes=3825240i,heap_idle_bytes=3883008i,heap_in_use_bytes=5128192i,heap_released_bytes=0i,frees=43436i,pointer_lookups=1182i,total_alloc_bytes=9563888i 1503522524000000000
> internal_gather,input=internal,host=redacted metrics_gathered=1i 1503522524000000000
> internal_agent,host=redacted metrics_dropped=0i,metrics_gathered=15i,gather_errors=0i,metrics_written=0i 1503522524000000000
> internal_write,output=influxdb,host=redacted buffer_limit=10000i,write_time_ns=0i,metrics_written=0i,metrics_filtered=0i,buffer_size=0i 1503522524000000000
> internal_gather,input=processes,host=redacted metrics_gathered=1i 1503522524000000000
> internal_gather,host=redacted,input=swap metrics_gathered=2i 1503522524000000000
> internal_gather,input=system,host=redacted metrics_gathered=2i 1503522524000000000
> internal_gather,input=disk,host=redacted metrics_gathered=0i 1503522524000000000
> internal_gather,input=diskio,host=redacted metrics_gathered=0i 1503522524000000000
> internal_gather,host=redacted,input=mongodb metrics_gathered=4i 1503522524000000000
> internal_gather,input=cpu,host=redacted metrics_gathered=3i 1503522524000000000
> internal_gather,input=kernel,host=redacted metrics_gathered=1i 1503522524000000000
> internal_gather,input=mem,host=redacted metrics_gathered=1i 1503522524000000000
> internal_gather,input=redis,host=redacted metrics_gathered=0i 1503522524000000000
* Plugin: inputs.redis, Collection 1
> redis_keyspace,host=redacted,server=localhost,port=6379,replication_role=master,database=db0 keys=91i,expires=88i,avg_ttl=57713995i 1503522524000000000
> redis,server=localhost,port=6379,replication_role=master,host=redacted uptime=1227401i,rdb_current_bgsave_time_sec=-1i,rdb_last_save_time=1502295122i,sync_full=0i,keyspace_misses=2271984i,loading=0i,rdb_changes_since_last_save=4549087i,aof_pending_rewrite=0i,used_cpu_user=722.27,aof_rewrite_buffer_length=0i,aof_pending_bio_fsync=0i,evicted_keys=0i,keyspace_hits=2270404i,master_repl_offset=0i,maxmemory_policy="noeviction",aof_enabled=1i,aof_rewrite_in_progress=0i,sync_partial_ok=0i,keyspace_hitrate=0.49982608266841144,used_memory=827424i,rdb_last_bgsave_time_sec=-1i,aof_current_size=41783033i,expired_keys=3601i,client_biggest_input_buf=0i,mem_fragmentation_ratio=6.35,rejected_connections=0i,repl_backlog_active=0i,used_memory_peak=1650944i,aof_rewrite_scheduled=0i,pubsub_channels=1i,repl_backlog_size=1048576i,aof_current_rewrite_time_sec=-1i,instantaneous_output_kbps=0.03,connected_slaves=0i,repl_backlog_histlen=0i,used_cpu_sys=1093.44,client_longest_output_list=0i,aof_buffer_length=0i,total_commands_processed=11419912i,repl_backlog_first_byte_offset=0i,rdb_bgsave_in_progress=0i,aof_last_write_status="ok",aof_delayed_fsync=0i,total_net_output_bytes=156369890i,sync_partial_err=0i,clients=14i,used_memory_rss=5255168i,used_cpu_user_children=0,total_connections_received=51764i,instantaneous_ops_per_sec=12i,instantaneous_input_kbps=0.83,used_memory_lua=39936i,maxmemory=0i,rdb_last_bgsave_status="ok",pubsub_patterns=0i,latest_fork_usec=204i,blocked_clients=9i,total_system_memory=4140580864i,aof_last_bgrewrite_status="ok",aof_base_size=21516i,migrate_cached_sockets=0i,used_cpu_sys_children=0,lru_clock=10350299i,rdb_last_save_time_elapsed=1227401i,aof_last_rewrite_time_sec=0i,total_net_input_bytes=770295625i,cluster_enabled=0i 1503522524000000000
* Plugin: inputs.disk, Collection 1
> disk,device=root,fstype=ext4,host=redacted,path=/ used=2692468736i,used_percent=5.649412439311064,inodes_total=2954880i,inodes_free=2870082i,inodes_used=84798i,total=50226184192i,free=44966801408i 1503522524000000000
* Plugin: inputs.diskio, Collection 1
> diskio,name=sdb,host=redacted read_bytes=2256896i,write_bytes=0i,read_time=63i,write_time=0i,io_time=50i,writes=0i,iops_in_progress=0i,reads=88i 1503522524000000000
> diskio,name=sda,host=redacted write_time=1548376i,io_time=1486360i,reads=12704i,read_bytes=354658304i,write_bytes=96178057216i,read_time=18120i,writes=4443201i,iops_in_progress=0i 1503522524000000000

danielnelson · 2017-08-23T21:22:58Z

On the failing host can you try to isolate the error to a single plugin?

danielnelson · 2017-08-23T22:03:38Z

It is possible that #3142 is related.

danielnelson · 2017-08-23T23:51:19Z

@sebastianarena #3142 is fixed. I think it is a long shot, but can you test with 1.4.0-rc2 to see if there is any improvement?

sebastianarena · 2017-08-24T16:27:34Z

Hello all.
Here's what I'm trying to do at the moment.

Regarding the host that causes trouble, it was all of them. I have centralized all our system logs, and I could clearly see that every server got an HTTP 400 from Influx at some moment, which leads to the invalid measurements.

Regarding the exact plugin that is causing the issue, I only have the basic stuff, plus Mongo and Redis. I was suspecting those might be trouble, but decided to try something else instead.

I shut down all Telegraf remote servers, and just left the TICK stack on one server, and one more remotely pushing measurements. So far, 1h after that, everything is looking ok.

Which seems to indicate that is not related to Telegraf, but to Influx per se, that gets overloaded with multiple servers sending their data in at the same time. Which worries me even more.

I'll leave this testing open for the time being, and will add hosts one at a time until it crashes again.
Any ideas? Maybe switch to UDP instead of HTTP?

danielnelson · 2017-08-24T18:43:47Z

UDP should be good for performance but obviously less reliable. In 1.4 we also are adding gzip support which should be an improvement if you are sending over a slow connection.

From what I have read, nginx only does HTTP/2 on the client side while the proxy side is HTTP/1.1, so it seems to me that this must be either a bug in Telegraf or nginx, since I haven't had any reports of this happening in HTTP/1.1 mode.

sebastianarena · 2017-08-24T19:20:40Z

I don't think nginx is involved.. because with a single host everything works fine.
The problem starts happening when I start adding more hosts, and more measurements.

The end of this was that I played the last couple of hours with interval, flush, and the jitters. I slowed down the intake of inputs, and now everything seems to be working fine.

Something is wrong somewhere, and possibly gzip would be an awesome addition to speed things up.

For now, I think I'm ok.
THANKS

danielnelson · 2017-08-24T19:58:23Z

Lets merge this issue with #2854

danielnelson added area/influxdb bug unexpected problem or unintended behavior labels Aug 23, 2017

danielnelson closed this as completed Aug 24, 2017

danielnelson added the duplicate label Aug 24, 2017

sebastianarena mentioned this issue Aug 30, 2017

Metric corruption when using http2 nginx reverse proxy with influxdb output #2854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid measurements found on Chronograf hours after setting up #3157

Invalid measurements found on Chronograf hours after setting up #3157

sebastianarena commented Aug 23, 2017 •

edited

Loading

goller commented Aug 23, 2017

danielnelson commented Aug 23, 2017

sebastianarena commented Aug 23, 2017

sebastianarena commented Aug 23, 2017

danielnelson commented Aug 23, 2017

danielnelson commented Aug 23, 2017

danielnelson commented Aug 23, 2017

sebastianarena commented Aug 24, 2017

danielnelson commented Aug 24, 2017

sebastianarena commented Aug 24, 2017

danielnelson commented Aug 24, 2017

Invalid measurements found on Chronograf hours after setting up #3157

Invalid measurements found on Chronograf hours after setting up #3157

Comments

sebastianarena commented Aug 23, 2017 • edited Loading

goller commented Aug 23, 2017

danielnelson commented Aug 23, 2017

sebastianarena commented Aug 23, 2017

sebastianarena commented Aug 23, 2017

danielnelson commented Aug 23, 2017

danielnelson commented Aug 23, 2017

danielnelson commented Aug 23, 2017

sebastianarena commented Aug 24, 2017

danielnelson commented Aug 24, 2017

sebastianarena commented Aug 24, 2017

danielnelson commented Aug 24, 2017

sebastianarena commented Aug 23, 2017 •

edited

Loading