-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bitswap: better diagnostics and observability on client #209
Comments
I don't know if that what you are asking for, but on the message tracking you can register for just the client: https://github.com/ipfs/go-libipfs/blob/e45eb146ba119528076276b4efb4f6123786eb29/bitswap/client/client.go#L72-L76 |
@hannahhoward : ACK on the request for better observability. If I understand correctly, message tracking on the client above keeps you unblocked for Rhea. Please share if that isn't the case. Also, I'm happy to get a synchronous meeting setup to figure out how to solve specific blocking usecases, For things like "FindProvider calls to the DHT and what they returned", I wonder if we should be developing the strategy of external tooling were can do deeper diagnostics (along the lines of IPFS check, vole, etc.). Rewriting the Bitswap client is in the plans for 2023 in light of the operational events and continued pain with this code. Traceability will be an important element. The specifics or timeline haven't been flushed out, but cross-linking to #73 |
I chatted briefly about this recently.
would be enough for useful triage / returned error messages |
@Jorropo @guseggert : I'm not committing to when we will do this work yet (as a lot happening right now before IPFS Thing!) but it will be important/useful to get a sense of:
I'm happy to talk in Slack or sync if we need some more back and forth. I'd like to have a sense of this before 2023-04-03 Rhea meeting. Thank you! |
This came up again this week as something pretty important for rhea: we need visibility into what blocks are coming from a specific peer. |
Ack. Two things here:
|
I have a few questions:
Assuming contextes are correctly passed (I need to test #172 with tracing), this could be done easily by viewing events in the traces. Would that be fine ?
You can wrap the bitswap network object to do this. I don't think it make sense for this to be exposed in the API because If you want thoses errors at the What of the propositions above would solve your problem ?
That is easy to do, this already exists as
This is not a metric bitswap internally keeps track of nor use in the algorithm, it wouldn't be very hard to expose this. |
Hannah's original post mentions "Right now I am just turning on a bunch of debug logs, which is fine but means I can't do anything with those errors in code, or report them for metrics." At the core we'd like to be able to correlate the peer-block-fetch activity back to sessions / requests. I think if instrumentation was through tracing we'd need to set up a custom exporter on each call to retrieve the spans? I don't know if that ends up being practical/efficient - my sense is that the tracing is meant to be sampled when run at scale for efficiency, but i might be wrong.
I think i agree that a decorated context error is a reasonable way to surface what happened in timeout cases.
some visibility into connectability is the main thing here: separating the cases into:
|
We would use This isn't propagated using the context, this is returned, but we wrap This will correctly supports https://pkg.go.dev/context#CancelCauseFunc assuming consumers |
I don't wrapping a context error is the right approach, cause we need metrics information on all requests, not just failures. For example, in the current cost challenges we had around web3.storage, the metric we needed to track is total downloaded from Web3.storage. But there's no public way to know "when I called GetBlock or GetBlocks, and got data successfully, who actually sent me the the block?" -- having in the bitswap code myself, I know this is knowable, it's just not exposed. It would also be great to know-- who did I ask for the block? How many did I send a want-have vs want-block to? Ultimately, Lassie wants to aggregate these values into metrics to surface things like:
If we don't want to modify GetBlock signature, then we could certainly return information through a listener of some sort. I agree that these are slightly different requests than in the original ticket, cause our understanding has evolved over time. One thing that might be useful to look at is the SP Impact section of the Lassie metrics dashboard -- this is what we're able to assemble from GraphSync requests -- what would it take to assemble the same precision of data from Bitswap? But yes, I agree with @BigLep the best thing would be to discuss this synchronously at IPFS Thing. |
I've asked @guseggert and @Jorropo to get time scheduled per https://filecoinproject.slack.com/archives/C03FFEVK30F/p1681771135478849 |
I'm gonna submit a pull request that creates a custom block type which indicates from which peer we downloaded a block and how long did it took from first attempt to successful download because it's not that hard and can get some valuable information. |
2023-06-02 update on where we're at. This issue hasn't been forgotten. Week of 2023-06-05, @Jorropo will be
@Jorropo will do this once he completes the Kubo 0.21 release blocker items on his plate (the first 3 "required" items in ipfs/kubo#9814). |
Notes from 2023-06-20 maintainer conversation: There was a communication gap to to acknowledge and update on the date slippage. What will be delivered: given Lassie having a custom metric setup, a PR won't be made in Lassie. Instead @Jorropo will produce a Prometheus example of bitswap metrics in Kubo. This has been delayed due to fixing goroutine leaks and deadlocks discovered in the latest Kubo 0.21 RC. @Jorropo is doing the Thunderdome testing to complete the 0.21 release (tracking issue). The only other items ahead of doing the bitswap work above is to complete PL EngRes performance review activities. We expect this to be completed by the end of the week (2023-06-23). @Jorropo will update otherwise. |
@Jorropo: please update if I have this incorrect, but my understanding from last week is that this will get done this week (week of 2023-07-10), and will happen after ipfs/kubo#9927. |
Per slack, "boxo 308 and Boxo 413 solves completely different problems". My understanding is that #413 is doing what we said we'd do in this issue here. Anyways, #308 will happen this week given the feedback that that is higher priority. |
When I call GetBlocks on the Bitswap client, I get back a channel and a synchronous error that is almost always nil. The primary way bitswap "errors" is by the context timing out. I don't know if we count find peers with the block, or we couldn't connect to them, or when we did we found they responded with don't have. So I have no way to track "what went wrong". I also don't have much of a way to track what went right -- if I get a block back, I don't know whom I got it from, the latency, the throughput, anything really.
This makes it very hard for consumers of the Bitswap library to measure performance or debug errors. Right now I am just turning on a bunch of debug logs, which is fine but means I can't do anything with those errors in code, or report them for metrics.
I realize as asynchronous code that ultimately talks to a pubsub protocol, returning errors or even channels of errors on GetBlocks may be hard. However, the client should at least be provided hooks so they can see more about why their requests are failing.
This hook seems like a good start: https://github.com/ipfs/go-libipfs/blob/main/bitswap/client/client.go#LL78C8-L78C8
For starters, we could have a hook that offered the same information, but also exposed Haves and Don't Haves.
Additionally though, it would be nice to be able to observe:
Maybe such hooks exist somewhere? At the top level bitswap there is https://github.com/ipfs/go-libipfs/blob/main/bitswap/tracer/tracer.go#L10 but that seems not to exist for just the client? Even the tracer appears to only track successfully sent messages, not errors.
The text was updated successfully, but these errors were encountered: