-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sending _stub_aggregator metrics to carbon cache #133
Comments
Steps to reproduce.
Immediately after, invoke
|
If I were to remove these lines:
Then the problem goes away. |
Right, that's due to issue #32. You best restart the relay, instead of reloading it when you use aggregates. |
For issues #32 and #133, get the aggregator to expire as if it was shutting down on a reload of the configuration. Side effect of this is that we need to suspend the dispatchers such that we don't get data while we're clearing out the aggregators. In this scheme, a reload will cause a slight service interruption. New connections are still accepted, but no data is read while the aggregations are flushed, and the dispatchers load the new configuration. The upside of this is that there will be an atomic reload, that is, everything is either routed according to the old configuration, or the new configuration, not both at the same time as was the case before this commit.
This should be fixed now. Tests welcome. |
I can still reproduce it, though it looks like a bit of a racy circumstance? (we receive about 2-3Million metrics a minute, 300-400k of which are processed for aggregation, so everything is a race :-). Here's the log as it happens (I just put in some printf's so I could better understand what is happening).
All I can conclude is that either all aggregates should be expired before reloading the config, or some other kind of identifier should be used that is persistent across reloads. |
there is a thinko, we need to avoid new metrics coming in, but the aggregator behaves as a regular client to the relay, so its metrics aren't processed as well. |
As continuation for fixing issue #133, add substantial amount of code to be able to achieve sending expired aggregations when reloading the config. We need this to make sure that aggregates that are sent as part of the shutdown are routed with the configuration they were created for. This is in particular necessary for metrics that hold pseudo stubs to perform target routing. So now we hold all dispatchers, then take the first and make it run the internal_submission server specifically (where aggregations are being written to) and drain that queue. After that we reload the configuration for all dispatchers and continue serving as normal.
I believe I finally cracked this problem. |
Yes, I can't seem to reproduce in the simple case anymore. 👍 |
That's good news! |
For issues grobian#32 and grobian#133, get the aggregator to expire as if it was shutting down on a reload of the configuration. Side effect of this is that we need to suspend the dispatchers such that we don't get data while we're clearing out the aggregators. In this scheme, a reload will cause a slight service interruption. New connections are still accepted, but no data is read while the aggregations are flushed, and the dispatchers load the new configuration. The upside of this is that there will be an atomic reload, that is, everything is either routed according to the old configuration, or the new configuration, not both at the same time as was the case before this commit.
As continuation for fixing issue grobian#133, add substantial amount of code to be able to achieve sending expired aggregations when reloading the config. We need this to make sure that aggregates that are sent as part of the shutdown are routed with the configuration they were created for. This is in particular necessary for metrics that hold pseudo stubs to perform target routing. So now we hold all dispatchers, then take the first and make it run the internal_submission server specifically (where aggregations are being written to) and drain that queue. After that we reload the configuration for all dispatchers and continue serving as normal.
Since update to v1.2, I'm seeing carbon-c-relay sending e.g:
_stub_aggregator_0x169a400__
to carbon after reloading the service (via upstart).The text was updated successfully, but these errors were encountered: