Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sending _stub_aggregator metrics to carbon cache #133

Closed
iain-buclaw-sociomantic opened this issue Dec 23, 2015 · 9 comments
Closed

Sending _stub_aggregator metrics to carbon cache #133

iain-buclaw-sociomantic opened this issue Dec 23, 2015 · 9 comments

Comments

@iain-buclaw-sociomantic
Copy link
Contributor

Since update to v1.2, I'm seeing carbon-c-relay sending e.g: _stub_aggregator_0x169a400__ to carbon after reloading the service (via upstart).

@iain-buclaw-sociomantic
Copy link
Contributor Author

Steps to reproduce.

  • routes.conf
cluster test
    file
        test.out
    ;

aggregate
        ^test\.
    every 10 seconds
    expire after 30 seconds
    compute sum write to
        aggregates.all.test
    send to test
    stop
    ;

match *
    send to test
    ;
  • command:
    ./relay -f routes.conf -s
  • input:
test.foo 0 1450887178
test.bar 1 1450887178
test.baz 2 1450887178

Immediately after, invoke SIGHUP (eg: kill -1 pidof relay``)

  • output:
_stub_aggregator_0x18f09f0__aggregates.all.test 3.000000 1450887180

@iain-buclaw-sociomantic
Copy link
Contributor Author

If I were to remove these lines:

    send to test
    stop

Then the problem goes away.

@grobian
Copy link
Owner

grobian commented Dec 23, 2015

Right, that's due to issue #32. You best restart the relay, instead of reloading it when you use aggregates.

@iain-buclaw-sociomantic iain-buclaw-sociomantic changed the title Sending _stub_aggregator metrics written to carbon cache Sending _stub_aggregator metrics to carbon cache Dec 24, 2015
grobian added a commit that referenced this issue Dec 25, 2015
For issues #32 and #133, get the aggregator to expire as if it was
shutting down on a reload of the configuration.  Side effect of this is
that we need to suspend the dispatchers such that we don't get data
while we're clearing out the aggregators.
In this scheme, a reload will cause a slight service interruption.
New connections are still accepted, but no data is read while the
aggregations are flushed, and the dispatchers load the new
configuration.  The upside of this is that there will be an atomic
reload, that is, everything is either routed according to the old
configuration, or the new configuration, not both at the same time as
was the case before this commit.
@grobian
Copy link
Owner

grobian commented Dec 25, 2015

This should be fixed now. Tests welcome.

@grobian grobian closed this as completed Dec 25, 2015
@iain-buclaw-sociomantic
Copy link
Contributor Author

I can still reproduce it, though it looks like a bit of a racy circumstance? (we receive about 2-3Million metrics a minute, 300-400k of which are processed for aggregation, so everything is a race :-).

Here's the log as it happens (I just put in some printf's so I could better understand what is happening).

===
=== Reloading (^test\.(.*)) using route: _stub_aggregator_0xd6fc10__
===
>>> Aggregator received metric: test.foo 123 1451382897
<<< Expiring aggregate metric: _stub_aggregator_0xd6fc10__aggregates.max.test 123.000000 1451382900
>>> Enqueuing: _stub_aggregator_0xd6fc10__aggregates.max.test 123.000000 1451382900
<<< Dequeuing: _stub_aggregator_0xd6fc10__aggregates.max.test 123.000000 1451382900
<<< Writing to pipe(10): _stub_aggregator_0xd6fc10__aggregates.max.test 123.000000 1451382900
>>> Enqueuing: aggregates.max.test 123.000000 1451382900
<<< Dequeuing: aggregates.max.test 123.000000 1451382900
<<< Writing to file(11): aggregates.max.test 123.000000 1451382900
[2015-12-29 10:55:05] caught SIGHUP...
[2015-12-29 10:55:05] reloading config from 'routes.conf'
===
=== Reloading (^test\.(.*)) using route: _stub_aggregator_0xd2b9d0__
===
[2015-12-29 10:55:05] reloading collector
[2015-12-29 10:55:06] interrupting workers
[2015-12-29 10:55:06] expiring aggregations
<<< Expiring aggregate metric: _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
>>> Enqueuing: _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
[2015-12-29 10:55:06] reloading workers
[2015-12-29 10:55:06] SIGHUP handler complete
<<< Dequeuing: _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
<<< Writing to pipe(10): _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
>>> Enqueuing: _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
<<< Dequeuing: _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900
<<< Writing to file(11): _stub_aggregator_0xd6fc10__aggregates.sum.test 123.000000 1451382900

All I can conclude is that either all aggregates should be expired before reloading the config, or some other kind of identifier should be used that is persistent across reloads.

@grobian
Copy link
Owner

grobian commented Dec 29, 2015

there is a thinko, we need to avoid new metrics coming in, but the aggregator behaves as a regular client to the relay, so its metrics aren't processed as well.

@grobian grobian reopened this Dec 31, 2015
grobian added a commit that referenced this issue Jan 3, 2016
As continuation for fixing issue #133, add substantial amount of code to
be able to achieve sending expired aggregations when reloading the
config.  We need this to make sure that aggregates that are sent as part
of the shutdown are routed with the configuration they were created for.
This is in particular necessary for metrics that hold pseudo stubs to
perform target routing.

So now we hold all dispatchers, then take the first and make it run the
internal_submission server specifically (where aggregations are being
written to) and drain that queue.  After that we reload the
configuration for all dispatchers and continue serving as normal.
@grobian
Copy link
Owner

grobian commented Jan 3, 2016

I believe I finally cracked this problem.

@grobian grobian closed this as completed Jan 3, 2016
@iain-buclaw-sociomantic
Copy link
Contributor Author

Yes, I can't seem to reproduce in the simple case anymore. 👍

@grobian
Copy link
Owner

grobian commented Jan 5, 2016

That's good news!

pkittenis pushed a commit to pkittenis/carbon-c-relay that referenced this issue Feb 3, 2016
For issues grobian#32 and grobian#133, get the aggregator to expire as if it was
shutting down on a reload of the configuration.  Side effect of this is
that we need to suspend the dispatchers such that we don't get data
while we're clearing out the aggregators.
In this scheme, a reload will cause a slight service interruption.
New connections are still accepted, but no data is read while the
aggregations are flushed, and the dispatchers load the new
configuration.  The upside of this is that there will be an atomic
reload, that is, everything is either routed according to the old
configuration, or the new configuration, not both at the same time as
was the case before this commit.
pkittenis pushed a commit to pkittenis/carbon-c-relay that referenced this issue Feb 3, 2016
As continuation for fixing issue grobian#133, add substantial amount of code to
be able to achieve sending expired aggregations when reloading the
config.  We need this to make sure that aggregates that are sent as part
of the shutdown are routed with the configuration they were created for.
This is in particular necessary for metrics that hold pseudo stubs to
perform target routing.

So now we hold all dispatchers, then take the first and make it run the
internal_submission server specifically (where aggregations are being
written to) and drain that queue.  After that we reload the
configuration for all dispatchers and continue serving as normal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants