Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include per-provider bitswap interactions in response timing headers #348

Open
willscott opened this issue Jul 6, 2023 · 9 comments
Open
Assignees
Labels
rhea Related to project Rhea saturn L1-node Related to filecoin-saturn L1-Nodes

Comments

@willscott
Copy link
Contributor

With #332 we have response headers on a per-peer basis.
We should make sure these also include attempts / observed bitswap peers as we get visibility into those connections.

In particular, if we get back bitswap peers from IPNI, but fail to connect to them, the response headers should indicate that / which peers couldn't be contacted / could-be-contacted-but-didn't-serve-data

@rvagg
Copy link
Member

rvagg commented Jul 7, 2023

How useful is this kind of information if we only get back data on peers up until the point at which someone starts serving us data and therefore we write data out and spit out the headers in their current state? It's not going to be very representative, or are we just looking for scraps of information about peers over a large enough number of calls to make some kind of inference?

Event recording or other instrumentation would be a more reliable mechanism for this wouldn't it?

@willscott
Copy link
Contributor Author

it will still make it visible when a peer breaks / becomes unavailable, because we'll start to see a bunch of connection attempts that don't resolve until the overall request either errors or completes with someone else.

@hannahhoward
Copy link
Collaborator

@willscott you want to note when the actual dial attempt for a peer in Bitswap doesn’t work? I’m not sure that’s currently accessible from within Bitswap, which handles the actual dialing. I guess we could attempt to dial ourselves also and record the result -- if Bitswap already dialed, the connect attempt would happen instantaneously, same the other way if our dial succeeded first.

@hannahhoward
Copy link
Collaborator

@rvagg assuming this lands on you, you could put dial attempts in https://github.com/filecoin-project/lassie/blob/main/pkg/retriever/bitswaphelpers/indexerrouting.go and then dispatch events accordingly.

@hannahhoward
Copy link
Collaborator

#345 adds something like this, but I think is not what you all care about -- it enables tracking of data received by peer including through bitswap.

@hannahhoward hannahhoward added saturn L1-node Related to filecoin-saturn L1-Nodes rhea Related to project Rhea labels Jul 13, 2023
@willscott
Copy link
Contributor Author

that seems plausible - if slightly inefficient.

Two goals:

  1. in a failure case where saturn needs to render an error page, we want to say:
  • was found on providers <a, b, c>
  • Providers <a, b> could not be connected to
  • Provider was connected to but didn't provide the data.
  1. as a trailer that is just consumed on the saturn l1, there's a desire to know how upstream data was obtained / from which peer IDs.

@hannahhoward
Copy link
Collaborator

hannahhoward commented Jul 19, 2023

when we say failure case, are we talking about only non-200 responses for Lassie or also premature stream close errors?

I assume for non-200 responses, we can say that those we could not get data from are simply the diff between providers found and providers we couldn't connect to

For premature stream close, we received some data from some peers presumably -- but eventually found a block where no data was returned from anyone. Should we simply say we failed with all peers? Or maybe do we need the CID we failed on?

@hannahhoward
Copy link
Collaborator

I believe #345 is ultimately a blocker here, which in turn is blocked on ipfs/boxo#308

@rvagg
Copy link
Member

rvagg commented Sep 21, 2023

@hannahhoward @willscott thoughts on the status of this? We could separate it out into two separate questions—the continued use of server-timings and whether it's really the best tool to achieve the goals here, and the actual collecting of dial attempts within lassie.

We could do the latter separately to the former, maybe even record all the failed dial attempts as failed retrieval attempts in our database; although we're yet to see what kind of volume this information requires (I'm concerned it might be way too much information even just storing the successful block retrievals we've enabled with v0.18.0).

It might also be the case that pre-dialing the peers before handing off to bitswap could get us past some of the problems we have with bitswap that are currently requiring us to pin to a particular boxo commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rhea Related to project Rhea saturn L1-node Related to filecoin-saturn L1-Nodes
Projects
None yet
Development

No branches or pull requests

3 participants