-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include per-provider bitswap interactions in response timing headers #348
Comments
How useful is this kind of information if we only get back data on peers up until the point at which someone starts serving us data and therefore we write data out and spit out the headers in their current state? It's not going to be very representative, or are we just looking for scraps of information about peers over a large enough number of calls to make some kind of inference? Event recording or other instrumentation would be a more reliable mechanism for this wouldn't it? |
it will still make it visible when a peer breaks / becomes unavailable, because we'll start to see a bunch of connection attempts that don't resolve until the overall request either errors or completes with someone else. |
@willscott you want to note when the actual dial attempt for a peer in Bitswap doesn’t work? I’m not sure that’s currently accessible from within Bitswap, which handles the actual dialing. I guess we could attempt to dial ourselves also and record the result -- if Bitswap already dialed, the connect attempt would happen instantaneously, same the other way if our dial succeeded first. |
@rvagg assuming this lands on you, you could put dial attempts in https://github.com/filecoin-project/lassie/blob/main/pkg/retriever/bitswaphelpers/indexerrouting.go and then dispatch events accordingly. |
#345 adds something like this, but I think is not what you all care about -- it enables tracking of data received by peer including through bitswap. |
that seems plausible - if slightly inefficient. Two goals:
|
when we say failure case, are we talking about only non-200 responses for Lassie or also premature stream close errors? I assume for non-200 responses, we can say that those we could not get data from are simply the diff between providers found and providers we couldn't connect to For premature stream close, we received some data from some peers presumably -- but eventually found a block where no data was returned from anyone. Should we simply say we failed with all peers? Or maybe do we need the CID we failed on? |
I believe #345 is ultimately a blocker here, which in turn is blocked on ipfs/boxo#308 |
@hannahhoward @willscott thoughts on the status of this? We could separate it out into two separate questions—the continued use of server-timings and whether it's really the best tool to achieve the goals here, and the actual collecting of dial attempts within lassie. We could do the latter separately to the former, maybe even record all the failed dial attempts as failed retrieval attempts in our database; although we're yet to see what kind of volume this information requires (I'm concerned it might be way too much information even just storing the successful block retrievals we've enabled with v0.18.0). It might also be the case that pre-dialing the peers before handing off to bitswap could get us past some of the problems we have with bitswap that are currently requiring us to pin to a particular boxo commit. |
With #332 we have response headers on a per-peer basis.
We should make sure these also include attempts / observed bitswap peers as we get visibility into those connections.
In particular, if we get back bitswap peers from IPNI, but fail to connect to them, the response headers should indicate that / which peers couldn't be contacted / could-be-contacted-but-didn't-serve-data
The text was updated successfully, but these errors were encountered: