Sending _stub_aggregator metrics to carbon cache #133

iain-buclaw-sociomantic · 2015-12-23T14:12:55Z

Since update to v1.2, I'm seeing carbon-c-relay sending e.g: _stub_aggregator_0x169a400__ to carbon after reloading the service (via upstart).

The text was updated successfully, but these errors were encountered:

iain-buclaw-sociomantic · 2015-12-23T16:17:33Z

Steps to reproduce.

routes.conf

cluster test
    file
        test.out
    ;

aggregate
        ^test\.
    every 10 seconds
    expire after 30 seconds
    compute sum write to
        aggregates.all.test
    send to test
    stop
    ;

match *
    send to test
    ;

command:
./relay -f routes.conf -s
input:

test.foo 0 1450887178
test.bar 1 1450887178
test.baz 2 1450887178

Immediately after, invoke SIGHUP (eg: kill -1 pidof relay``)

output:

_stub_aggregator_0x18f09f0__aggregates.all.test 3.000000 1450887180

iain-buclaw-sociomantic · 2015-12-23T16:18:08Z

If I were to remove these lines:

    send to test
    stop

Then the problem goes away.

grobian · 2015-12-23T19:39:27Z

Right, that's due to issue #32. You best restart the relay, instead of reloading it when you use aggregates.

For issues #32 and #133, get the aggregator to expire as if it was shutting down on a reload of the configuration. Side effect of this is that we need to suspend the dispatchers such that we don't get data while we're clearing out the aggregators. In this scheme, a reload will cause a slight service interruption. New connections are still accepted, but no data is read while the aggregations are flushed, and the dispatchers load the new configuration. The upside of this is that there will be an atomic reload, that is, everything is either routed according to the old configuration, or the new configuration, not both at the same time as was the case before this commit.

grobian · 2015-12-25T10:18:38Z

This should be fixed now. Tests welcome.

iain-buclaw-sociomantic · 2015-12-29T10:04:03Z

I can still reproduce it, though it looks like a bit of a racy circumstance? (we receive about 2-3Million metrics a minute, 300-400k of which are processed for aggregation, so everything is a race :-).

Here's the log as it happens (I just put in some printf's so I could better understand what is happening).

===
=== Reloading (^test\.(.*)) using route: _stub_aggregator_0xd6fc10__
===
>>> Aggregator received metric: test.foo 123 1451382897
<<< Expiring aggregate metric: _stub_aggregator_0xd6fc10__aggregates.max.test 123.000000 1451382900
>>> Enqueuing: _stub_aggregator_0xd6fc10__aggregates.max.test 123.000000 1451382900
<<< Dequeuing: _stub_aggregator_0xd6fc10__aggregates.max.test 123.000000 1451382900
<<< Writing to pipe(10): _stub_aggregator_0xd6fc10__aggregates.max.test 123.000000 1451382900
>>> Enqueuing: aggregates.max.test 123.000000 1451382900
<<< Dequeuing: aggregates.max.test 123.000000 1451382900
<<< Writing to file(11): aggregates.max.test 123.000000 1451382900
[2015-12-29 10:55:05] caught SIGHUP...
[2015-12-29 10:55:05] reloading config from 'routes.conf'
===
=== Reloading (^test\.(.*)) using route: _stub_aggregator_0xd2b9d0__
===
[2015-12-29 10:55:05] reloading collector
[2015-12-29 10:55:06] interrupting workers
[2015-12-29 10:55:06] expiring aggregations
<<< Expiring aggregate metric: _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
>>> Enqueuing: _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
[2015-12-29 10:55:06] reloading workers
[2015-12-29 10:55:06] SIGHUP handler complete
<<< Dequeuing: _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
<<< Writing to pipe(10): _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
>>> Enqueuing: _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
<<< Dequeuing: _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
<<< Writing to file(11): _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900

All I can conclude is that either all aggregates should be expired before reloading the config, or some other kind of identifier should be used that is persistent across reloads.

grobian · 2015-12-29T10:49:30Z

there is a thinko, we need to avoid new metrics coming in, but the aggregator behaves as a regular client to the relay, so its metrics aren't processed as well.

As continuation for fixing issue #133, add substantial amount of code to be able to achieve sending expired aggregations when reloading the config. We need this to make sure that aggregates that are sent as part of the shutdown are routed with the configuration they were created for. This is in particular necessary for metrics that hold pseudo stubs to perform target routing. So now we hold all dispatchers, then take the first and make it run the internal_submission server specifically (where aggregations are being written to) and drain that queue. After that we reload the configuration for all dispatchers and continue serving as normal.

grobian · 2016-01-03T07:49:54Z

I believe I finally cracked this problem.

iain-buclaw-sociomantic · 2016-01-05T16:43:17Z

Yes, I can't seem to reproduce in the simple case anymore. 👍

grobian · 2016-01-05T18:12:26Z

That's good news!

For issues grobian#32 and grobian#133, get the aggregator to expire as if it was shutting down on a reload of the configuration. Side effect of this is that we need to suspend the dispatchers such that we don't get data while we're clearing out the aggregators. In this scheme, a reload will cause a slight service interruption. New connections are still accepted, but no data is read while the aggregations are flushed, and the dispatchers load the new configuration. The upside of this is that there will be an atomic reload, that is, everything is either routed according to the old configuration, or the new configuration, not both at the same time as was the case before this commit.

As continuation for fixing issue grobian#133, add substantial amount of code to be able to achieve sending expired aggregations when reloading the config. We need this to make sure that aggregates that are sent as part of the shutdown are routed with the configuration they were created for. This is in particular necessary for metrics that hold pseudo stubs to perform target routing. So now we hold all dispatchers, then take the first and make it run the internal_submission server specifically (where aggregations are being written to) and drain that queue. After that we reload the configuration for all dispatchers and continue serving as normal.

iain-buclaw-sociomantic changed the title ~~Sending _stub_aggregator metrics written to carbon cache~~ Sending _stub_aggregator metrics to carbon cache Dec 24, 2015

grobian closed this as completed Dec 25, 2015

grobian reopened this Dec 31, 2015

grobian closed this as completed Jan 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sending _stub_aggregator metrics to carbon cache #133

Sending _stub_aggregator metrics to carbon cache #133

iain-buclaw-sociomantic commented Dec 23, 2015

iain-buclaw-sociomantic commented Dec 23, 2015

iain-buclaw-sociomantic commented Dec 23, 2015

grobian commented Dec 23, 2015

grobian commented Dec 25, 2015

iain-buclaw-sociomantic commented Dec 29, 2015

grobian commented Dec 29, 2015

grobian commented Jan 3, 2016

iain-buclaw-sociomantic commented Jan 5, 2016

grobian commented Jan 5, 2016

Sending _stub_aggregator metrics to carbon cache #133

Sending _stub_aggregator metrics to carbon cache #133

Comments

iain-buclaw-sociomantic commented Dec 23, 2015

iain-buclaw-sociomantic commented Dec 23, 2015

iain-buclaw-sociomantic commented Dec 23, 2015

grobian commented Dec 23, 2015

grobian commented Dec 25, 2015

iain-buclaw-sociomantic commented Dec 29, 2015

grobian commented Dec 29, 2015

grobian commented Jan 3, 2016

iain-buclaw-sociomantic commented Jan 5, 2016

grobian commented Jan 5, 2016