-
Notifications
You must be signed in to change notification settings - Fork 76
BandwidthCounter is erroneously reporting huge spikes #65
Comments
Hm. So, this could be due to how we record bandwidth internally. We:
Unfortunately, this means:
Given a 1ns "tick" and a 10MiB message, we'll get a 9 exabyte spike.
|
If possible (if this is easy to reproduce), could you try building with: libp2p/go-flow-metrics#8?
We could also have an overflow/underflow bug somewhere but I'm not seeing it.
|
@Stebalien thanks for your response. I'm skeptical that this is purely due to a short sample time. Let's suppose that we recorded bandwidth over a period of 1 nanosecond (which is the shortest possible due to the precision of |
In terms of reproducibility, since we don't know exactly what is causing the issue it can take some time to reproduce. The spikes are consistent but not very frequent. The graphs I shared are spikes occurring across the entire network (about 3-5 spikes per day on average). But if you just look at a single node, we're seeing one spike every 2-5 days. Once you have a new version of |
You're right. I was calculating in tebibytes, not exbibytes.
|
You can see our code that logs bandwidth usage every 5 minutes, and the separate code that bans peers which have high bandwidth usage. In both cases the code is really straightforward and we aren't doing any additional math or processing on top of what we get from These logs get converted to JSON and sent to Elasticsearch via Fluent Bit. The screenshots I shared are from Kibana, which we use to visualize the data in Elasticsearch. One thing you just made me realize is that JSON cannot represent extraordinarily large numbers accurately. While that is true, I don't think that completely explains the behavior we're seeing. Moreover, we can see that some peers are actually getting banned as a result of these spikes. That indicates that the numbers inside of Go are extraordinarily high independent of the limits of the JSON format or hypothetical issues with our ELK stack. |
@Stebalien do you still want me to try out libp2p/go-flow-metrics#8? I'm not sure if we got consensus on that fix but it might still be worth a try. |
Actually... are you doing this from web assembly? If the clock isn't monotonic, we could end up:
However, we'd still need to push 9 GiB of data for that to happen. On the other hand, that could happen ever few days. However, this is still really fishy (unless webassembly is giving us a fake time?). I've pushed a fix for time travel to that PR in case you'd like to test it. But I'm far from confident that it's the issue. |
@Stebalien Good question. None of the results I've shared so far are from WebAssembly. (We actually don't have a way to collect logs from browser nodes at the moment).
Got it. We might try that PR if there is nothing else for us to do. In the meantime, we probably will need to implement a workaround on our end to account for these spikes. It's hard to do because most workarounds we can think of also open doors to potential attackers/spammers. |
Damn. Ok. So, go is usually pretty good about monotonic clocks. I only worried about web assembly as it's new and relatively unsupported. |
I really can't find the bug. I've combed over github.com/libp2p/go-flow-metrics and can't find anything that could cause this. If you're willing to annotate that code with some debug logging, that would help us figure out what's going on here. Specifically, we'd want to:
|
@Stebalien Good idea. Yes we should be able to add logging for debugging purposes. But keep in mind the feedback loop here will be fairly slow. |
Understood. At this point, I really have no idea what's going on. |
@Stebalien we have a PR here: 0xProject/0x-mesh#513 which will log everything you asked for (and a bit more) whenever the reported rate is extraordinarily high. As I mentioned earlier, it might take a few days to get results. I'll report anything we find on this thread. |
@Stebalien Actually it looks like I already have some useful logs. If I'm interpreting this correctly, the bug is occurring a lot more frequently than we thought. (Perhaps it only seemed rare because, relatively speaking, we don't check bandwidth inside of Mesh very frequently). I added the following logging code to oldRate := m.snapshot.Rate
if m.snapshot.Rate == 0 {
m.snapshot.Rate = instant
} else {
m.snapshot.Rate += alpha * (instant - m.snapshot.Rate)
}
if m.snapshot.Rate >= 100000000 {
logrus.WithFields(logrus.Fields{
"oldRate": oldRate,
"rate": m.snapshot.Rate,
"oldTotal": m.snapshot.Total,
"newTotal": total,
"instant": instant,
"diff": diff,
"alpha": alpha,
"snapshot.LastUpdate": m.snapshot.LastUpdate,
"sw.LastUpdate": sw.lastUpdateTime,
"tdiff": tdiff,
"timeMultiplier": timeMultiplier,
}).Debug("abnormal rate inside go-flow-metrics")
} Here's a gist containing some recent results. Let me know if I can provide any additional information. |
Ok, so we're resetting the total to 0 for some reason. |
The bug happens here:
We do reset the accumulator to 0 when we mark the meter as "idle" but then we go through a very careful dance to unregister old meters. @vyzo, could you go over the code in |
The logic is not so simple; and there is suspicious floating point arithmetic involved. |
So possibly an issue is the subtraction of |
ah it happens at the bottom of the loop; nvm. |
So the issue seems to be here: https://github.com/libp2p/go-flow-metrics/blob/master/sweeper.go#L151 We copy a meter from the back, that hasn't been visited yet and could potentially be an idle timer. |
libp2p/go-flow-metrics#11 fixes the issue discovered, and hopefully squashes the bug. |
The bug might be worse than we thought. In my attempt to make a regression test in libp2p/go-flow-metrics#12 I didn't trigger the spike, but got mangled totals for the affected meters. |
We recently added low-level rate-limiting in 0x Mesh. It works by using a
BandwidthCounter
to periodically check the incoming bandwidth for each peer and then banning peers who exceed the bandwidth limit. We are also piping logs to our Elasticsearch stack so we can monitor bandwidth usage.The logs are showing occasional massive spikes in bandwidth usage, far higher than what is reasonable/possible. In one case
BandwidthCounter
reported an incoming bandwidth of 9 exabytes/second from a single peer, which unfortunately caused that peer to be temporarily banned 😦. There is no way we are actually using that much bandwidth on commodity cloud hosting services like AWS or DigitalOcean, even over a very short time interval. I think the only reasonable explanation is a bug inBandwidthCounter
.I'll attach some screenshots from our logging infrastructure. Let me know if any additional information could be helpful.
This is what normal bandwidth usage looks like (incoming bandwidth maxes out at around 1.5MB/s total across all peers):
The text was updated successfully, but these errors were encountered: