Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when date format is incorrect #239

Closed
poblahblahblah opened this issue Dec 8, 2016 · 17 comments
Closed

segfault when date format is incorrect #239

poblahblahblah opened this issue Dec 8, 2016 · 17 comments

Comments

@poblahblahblah
Copy link

Hello,

I am currently testing out carbon-c-relay and had accidentally configured the metric emitter to send a higher resolution timestamp (milliseconds rather than seconds) than I meant to. This resulted in a segfault from carbon-c-relay.

This results in a segfault:

# Current time in milliseconds
now=$(($(date +%s%N)/1000000))
metric=test.metric
value=$(( $RANDOM % 100 ))
host=127.0.0.1

This is fine and works as expected:

now=$(date +'%s')
metric=test.metric
value=$(echo $(( $RANDOM % 10 )))
host=127.0.0.1

I am on CentOS7.2, using carbon-c-relay 2.3. I also tested this against HEAD in master and it caused a segfault there as well.

@grobian
Copy link
Owner

grobian commented Dec 9, 2016

I don't really get this syntax. Where do you put this and how does it relate to carbon-c-relay?

@poblahblahblah
Copy link
Author

@grobian I should have been more clear - those are shell scripts that act as "emitters" to send fake metrics to carbon-c-relay.

When you send the timestamp as time since unix epoch in milliseconds it causes carbon-c-relay to segfault. I am presuming that this is an integer overflow.

@piotr1212
Copy link
Contributor

piotr1212 commented Dec 9, 2016 via email

@poblahblahblah
Copy link
Author

carbon-c-relay config:

cluster kairos forward 192.168.168.11;

aggregate
    ^sys\.dc1\.somehost-[0-9]+\.somecluster\.mysql\.replication_delay
  every 30 seconds
  expire after 60 seconds
  compute average write to
    mysql.somecluster.average_replication_delay
  send to kairos
;

match *
  send to kairos
;

version: carbon-c-relay v2.3 (633ff6)

Here is an example of a emitter sending something good (using netcat as a "listener"):

sys.dc1.somehost-1.somecluster.mysql.replication_delay 6 1481310457

Here is an example of an emitter sending something that causes carbon-c-relay to crash:

sys.dc1.somehost-1.somecluster.mysql.replication_delay 3 1481310571477

So, yeah - sending the timestamp in milliseconds causes carbon-c-relay to segfault. I know clients out there are able to send in millisecond resolution even though it's not supported by carbon-c-relay, so it seems like a misconfigured client could pretty easily bring down a carbon-c-relay based setup.

@grobian
Copy link
Owner

grobian commented Dec 10, 2016

ok, this is an aggregator problem then, will have a look.

@grobian
Copy link
Owner

grobian commented Dec 10, 2016

I tried your example, it doesn't crash for me. I've checked the code, and it does bounds checking, so a crash on this is not expected either.

@grobian
Copy link
Owner

grobian commented Dec 10, 2016

A stacktrace would be very helpful here, if possible.

@grobian
Copy link
Owner

grobian commented Dec 28, 2016

I tried reproducing once again, but no matter what I try, no crack.

@grobian grobian closed this as completed Dec 28, 2016
@poblahblahblah
Copy link
Author

poblahblahblah commented Jan 5, 2017

sorry, @grobian was busy with work, then on vacation.

How would I go about getting a stacktrace? I can still trigger the crash locally on 2.3 and the recently released 2.4.

@grobian
Copy link
Owner

grobian commented Jan 5, 2017

Ok, is it possible for you to compile carbon-c-relay yourself? If so, use CFLAGS="-g". Then run the relay in gdb, like gdb --args ./relay . When it crashes, type thread apply all bt -- that should give the information I'm looking for.

@grobian grobian reopened this Jan 5, 2017
@poblahblahblah
Copy link
Author

poblahblahblah commented Jan 5, 2017

@grobian is this what you are looking for:

[vagrant@localhost carbon-c-relay-2.4]$ gdb --args relay -f conf/relay.conf
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/vagrant/carbon-c-relay-2.4/relay...done.

(gdb) run
Starting program: /home/vagrant/carbon-c-relay-2.4/relay -f conf/relay.conf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[2017-01-05 18:28:29] starting carbon-c-relay v2.4 (2017-01-05), pid=4434
configuration:
    relay hostname = localhost.localdomain
    listen port = 2003
    workers = 1
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    statistics submission interval = 60s
    listen backlog = 32
    server connection IO timeout = 600ms
    routes configuration = conf/relay.conf

parsed configuration follows:
cluster kairos
    forward
        localhost:4444
    ;

aggregate (.*)\.bidder\.(.*)\.blah\.(.*)
    every 30 seconds
    expire after 60 seconds
    timestamp at end of bucket
    compute average write to
        agg.platform.bidder.all.blah.\3
    compute average write to
        agg.\1.bidder.all.blah.\3
    send to kairos
    ;

[2017-01-05 18:28:29] listening on tcp4 0.0.0.0 port 2003
[2017-01-05 18:28:29] listening on tcp6 :: port 2003
[2017-01-05 18:28:29] listening on udp4 0.0.0.0 port 2003
[2017-01-05 18:28:29] listening on udp6 :: port 2003
[2017-01-05 18:28:29] listening on UNIX socket /tmp/.s.carbon-c-relay.2003
[New Thread 0x7ffff61d4700 (LWP 4438)]
[2017-01-05 18:28:29] starting 1 workers
[New Thread 0x7ffff59d3700 (LWP 4439)]
[New Thread 0x7ffff51d2700 (LWP 4440)]
[2017-01-05 18:28:29] starting aggregator
[New Thread 0x7ffff49d1700 (LWP 4441)]
[2017-01-05 18:28:29] starting statistics collector
[New Thread 0x7ffff41d0700 (LWP 4442)]
[2017-01-05 18:28:29] starting servers
[New Thread 0x7ffff39cf700 (LWP 4443)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff59d3700 (LWP 4439)]
aggregator_putmetric (s=0x6237f0, metric=metric@entry=0x7ffff61e3d0d "pob.bidder.pobtest001.blah.test 9 1483578550658\n",
    firstspace=firstspace@entry=0x7ffff61e3d5d " 9 1483578550658\n", nmatch=4, pmatch=pmatch@entry=0x7ffff59d0ad0) at aggregator.c:326
326			if (bucket->cnt == 0) {
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.8.x86_64
(gdb) thread apply all bt

Thread 7 (Thread 0x7ffff39cf700 (LWP 4443)):
#0  0x00007ffff75b8efd in nanosleep () from /lib64/libc.so.6
#1  0x00007ffff75e9b34 in usleep () from /lib64/libc.so.6
#2  0x000000000040f1af in server_queuereader (d=0x61b1b0) at server.c:119
#3  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff75f1ced in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7ffff41d0700 (LWP 4442)):
#0  0x00007ffff75b8efd in nanosleep () from /lib64/libc.so.6
#1  0x00007ffff75b8d94 in sleep () from /lib64/libc.so.6
#2  0x0000000000410569 in collector_runner (s=<optimized out>) at collector.c:140
#3  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff75f1ced in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7ffff49d1700 (LWP 4441)):
#0  0x00007ffff75b8efd in nanosleep () from /lib64/libc.so.6
#1  0x00007ffff75e9b34 in usleep () from /lib64/libc.so.6
#2  0x0000000000412168 in aggregator_expire (sub=0x6237f0) at aggregator.c:568
#3  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff75f1ced in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7ffff51d2700 (LWP 4440)):
#0  0x00007ffff75b8efd in nanosleep () from /lib64/libc.so.6
#1  0x00007ffff75e9b34 in usleep () from /lib64/libc.so.6
#2  0x000000000040f1af in server_queuereader (d=0x621520) at server.c:119
#3  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff75f1ced in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7ffff59d3700 (LWP 4439)):
#0  aggregator_putmetric (s=0x6237f0, metric=metric@entry=0x7ffff61e3d0d "pob.bidder.pobtest001.blah.test 9 1483578550658\n",
    firstspace=firstspace@entry=0x7ffff61e3d5d " 9 1483578550658\n", nmatch=4, pmatch=pmatch@entry=0x7ffff59d0ad0) at aggregator.c:326
#1  0x00000000004079a5 in router_route_intern (blackholed=blackholed@entry=0x7ffff59d2d27 "", ret=ret@entry=0x7ffff61e5d10, curlen=curlen@entry=0x7ffff59d2d28, retsize=retsize@entry=64,
    srcaddr=srcaddr@entry=0x7ffff61e1ced "192.168.168.11", metric=metric@entry=0x7ffff61e3d0d "pob.bidder.pobtest001.blah.test 9 1483578550658\n",
    firstspace=firstspace@entry=0x7ffff61e3d5d " 9 1483578550658\n", r=0x7ffff7fac298) at router.c:3073
#2  0x000000000040eb27 in router_route (rtr=<optimized out>, ret=ret@entry=0x7ffff61e5d10, retcnt=retcnt@entry=0x7ffff61e6110, retsize=retsize@entry=64, srcaddr=srcaddr@entry=0x7ffff61e1ced "192.168.168.11",
    metric=metric@entry=0x7ffff61e3d0d "pob.bidder.pobtest001.blah.test 9 1483578550658\n", firstspace=firstspace@entry=0x7ffff61e3d5d " 9 1483578550658\n") at router.c:3180
#3  0x0000000000406819 in dispatch_connection (start=..., self=0x621260, conn=0x7ffff61e1ce8) at dispatcher.c:447
#4  dispatch_runner (arg=0x621260) at dispatcher.c:628
#5  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007ffff75f1ced in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7ffff61d4700 (LWP 4438)):
#0  0x00007ffff75e769d in poll () from /lib64/libc.so.6
#1  0x0000000000406438 in dispatch_runner (arg=0x623840) at dispatcher.c:578
#2  0x00007ffff78c4dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff75f1ced in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7ffff7fed740 (LWP 4434)):
#0  0x00007ffff75b8efd in nanosleep () from /lib64/libc.so.6
#1  0x00007ffff75b8d94 in sleep () from /lib64/libc.so.6
#2  0x0000000000402e26 in main (argc=<optimized out>, argv=<optimized out>) at relay.c:827

@grobian
Copy link
Owner

grobian commented Jan 5, 2017

YES! Thank you!!!

grobian added a commit that referenced this issue Jan 5, 2017
This should be the fix for issue #239.  The bucket position was
calculated in a temp variable which was not big enough to hold the
result.  Likely the truncated value fitted in the position list, but
the bucket wasn't initialised yet.
@grobian
Copy link
Owner

grobian commented Jan 5, 2017

Can you try if my latest commit changes anything for you? After pushing I got a bit sceptical whether or not I found the issue.

@poblahblahblah
Copy link
Author

@grobian Everything looks good on the latest commit. Here is the netcat output that carbon-c-relay is relaying to:

pob.bidder.pobtest001.blah.test 9 1483578450
pob.bidder.pobtest001.blah.test 9 1483578550658

@poblahblahblah
Copy link
Author

@grobian also, thanks a lot for the fix and for carbon-c-relay!

@grobian
Copy link
Owner

grobian commented Jan 6, 2017

So, are you happy with this behaviour? (It should drop the metric with a message about being too much in the future or something) If this is ok for you, I'll close this and release 2.5 soon.

@poblahblahblah
Copy link
Author

Yeah, this behavior is fine with me.

@grobian grobian closed this as completed Jan 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants