Update to use delegated-routing for querying storetheindex #162

BigLep · 2022-03-09T00:56:10Z

Done Criteria

Updated 2022-08-11 to capture the latest state:

Hydras in production across the whole fleet query storetheindex using reframe rather than the storetheindex provider that was added in Add a storetheindex delegated provider #158
The custom storetheindex code in libp2p/hydra-booster is removed and deployed to production.
Hydra dashboards have metrics for their calls to storetheindex. We can answer these questions:
- Number of calls Hydra made to STI (regardless if successful or not)
- Number of calls that Hydra got a 2xx (success) response from STI (regardless if STI has providers for the given CID or not)
- Number of calls that fataled on the server (e.g., 5xx due to server issue)
- Number of calls where client timed out (and thus didn't get a server response)
- Distribution of 2xx response payloads sizes (in terms of number of records). For each 2xx responses, we should accumulate a metric for the number of providers in the response. This allows us to say the the p90 of responses have X providers.
- Latecy of each request, broken out by status code.

Why Important

Provides first production validation of delegated routing, giving us the confidence to add it to Kubo as part of ipfs/kubo#8775

Notes

We will use the ipld/edelweiss generated version of ipfs/go-delegated-routing happening in Implementation of delegated routing based on the Edelweiss compiler ipfs/go-delegated-routing#11
The depends on storetheindex to expose a delegated-routing endpoint, happening in Add Reframe protocol server ipni/storetheindex#251
https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test-.* is the Hydra dashboard that should be updated.
Cases where storetheindex has 0 results for a given CID and the corresponding status code is an open spec item being clarified in Clarify HTTP response status codes in reframe/REFRAME_HTTP_TRANSPORT.md for empty results ipfs/specs#308.
This is "Stage 0" in https://www.notion.so/pl-strflt/Indexer-IPFS-Reframe-Q3-Plan-77d0f3fa193b438784d2d85096e315e7
We don't need to include/deploy the latest "read"-related functionality in the reframe spec, including HTTP caching. That will happen separately when Add cacheable GET endpoint for findProviders ipfs/go-delegated-routing#27 completes.

Estimation Notes

2022-08-19 estimates of work remaining:

petar · 2022-05-27T21:07:19Z

@BigLep All tasks are done here, specifically:

Hydra has metrics for Reframe path
STI has metrics for Reframe path
All old STI code is removed from Hydra
Both Hydra and STI use the newest version of Delegated Routing and Edelweiss
We've verified that Hydra and STI talk to each other

The next steps would be:

deployment of STI to production (@willscott), then
deployment of Hydra to production (@petar)

BigLep · 2022-05-28T00:31:34Z

Thanks @petar . Lets track the storetheindex production deployment in ipni/storetheindex#251 . Hydra production deployment will be here.

To be clear, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?

Good stuff - almost there!

BigLep · 2022-07-24T03:20:03Z

@petar : following up, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?

petar · 2022-07-25T17:08:06Z

@petar : following up, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?

Yes. The delegated routing code replaces the STI code and uses the same metric names. So the dashboard should work unchanged.

BigLep · 2022-07-26T15:01:02Z

@petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?

petar · 2022-07-26T15:08:45Z

@petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?

Yes.

willscott · 2022-07-26T15:11:48Z

@thattommyhall I believe you pinged that we're still using the older protocol for all but the test instance - and that's what i see on the dashboard as well.

I didn't see the removal of the http indexer code go past on github, but noting that it means we probably should coordinate broader deployment of reframe before we end up with a deployment that doesn't support the current setup.

BigLep · 2022-08-12T06:45:35Z

Discussion about this effort is currently happening in the #reframe channel: https://filecoinproject.slack.com/archives/C03RQFURZLM

BigLep · 2022-08-12T19:33:59Z

Per 2022-08-12 verbal conversations, @guseggert is going to drive this effort to close and will consult with @petar as needed.

guseggert · 2022-08-15T20:41:38Z

So the logs originally showed the timeouts were due to a timeout while reading the response body:

2022-08-14T09:37:46.803Z	ERROR	service/client/delegatedrouting	proto/proto_edelweiss.go:1234	client received error response (context deadline exceeded (Client.Timeout or context cancellation while reading body))

This morning I deployed 37dda22 to the test flight, which publishes more detailed error metrics for Reframe and also upgrades to [email protected] and Go 1.18. After deployment, the timeouts basically disappeared, and it's been baking for a few hours and the traffic is at similar levels now without timeouts. This leads me to believe that the issue is probably that the Hydra node was taking too long to read the response body due to some environmental issue (overload from some other work it was doing). Something in the go-libp2p or Go upgrades might have also alleviated the bottleneck, e.g. libp2p resource manager. We see similar timeouts in prod with the non-Reframe StoreTheIndex client, although not nearly at the same rate, but the test flight could have gotten unlucky and been placed into a hot partition, so it could still be the same issue.

My next step is to get some metrics into the dashboard on libp2p Resource Manager to see if it's throttling anything, and understand the impact of that on the network, see if we need to tweak limits, etc. I'm guessing that RM is throttling b/c the AddProvider rate is much lower, while the STI rate is the same.

Enumerating the options to mitigate overloading:

Add a rate limiter to cap the rate of AddProvider DHT calls
- We should do this regardless of the other options, as this is the only way we can prevent nodes from being overloaded when traffic patterns change
- Some calls will start failing for the other nodes that are calling AddProvider, what's the impact of this?
- This might already be happening with the libp2p upgrade and libp2p resource manager
Reduce the number of heads that each node runs
- This will increase the overall cost as we'll need to scale up the fleet to accomodate
- Traffic pattern changes in the network could still cause overloads
Do some analysis on the nature of the calls to see if caching could alleviate the load
- Are there hot CIDs/addrs that we could shed w/ caching?
- Traffic pattern changes in the network could still cause overloads

guseggert · 2022-08-17T16:14:00Z

I've integrated Resource Manager, added RM metrics, added them to the dashboard, and tweaked the RM limits to be low enough to throttle spikes but to generally allow most traffic. The test node is now operating at the same capacity as before, but with minimal timeouts. There are still occasional timeouts (about 0.3% of reqs). These are timeouts reading response headers, so this may be a server-side thing, although I will increase the client-side timeout to 3 seconds to allow for e.g. GC to run w/o causing timeouts.

I'm working on this branch: https://github.com/libp2p/hydra-booster/tree/feat/reframe-metrics

I'll get a PR worked up, and continue to let this bake today. If it looks okay tomorrow morning, I'll roll it out to the rest of the fleet.

BigLep · 2022-08-19T15:55:53Z

2022-08-19 conversation:

PR is out with updated libp2p, go version, metrics, etc: Address reframe timeouts & add metrics #177
We've deployed to the test suite.
Planning to deploy to production 2022-08-22

BigLep · 2022-08-19T23:59:53Z

@guseggert : other thoughts from looking at this after:

I worry that it isn't going to be clear for anyone looking at "StoreTheIndex Reuests / Sec" what "Net", "NetTimeout", and "Other" mean. Can we maybe add an "info panel" (assuming something like that exists) with an explainer note and link to canonical information.
Please handle Clarify HTTP response status codes in reframe/REFRAME_HTTP_TRANSPORT.md for empty results ipfs/specs#308 and ensure go-delegated-routing is doing the right thing.
For the latency metrics, do we have other values besides average. For example, I'm curious what the p99/p100 is for "success".
Did we do this from the done criteria: "Distribution of 2xx response payloads sizes (in terms of number of records). For each 2xx responses, we should accumulate a metric for the number of providers in the response. This allows us to say the the p90 of responses have X providers."

guseggert · 2022-09-07T14:38:14Z

is done.

I did 2) a couple weeks ago, see ipfs/specs@0b123fa

done in 6967a65
done in d48d898

guseggert · 2022-09-07T14:42:07Z

Update: last week I deployed Reframe to the full Hydra fleet, but almost all reqs started timing out so I rolled it back. Have been debugging w/ @masih in between traveling.

Yesterday there was an STI event that caused the HTTP endpoint to behave a like the Reframe timeouts, so I'm working with @masih to understand the root cause. If it doesn't rule out Reframe, then I'll wait for the root cause fix and redeploy to see if it also works for Reframe, and if it doesn't then we'll need to do some req tracing through the infrastructure to see where exactly the timeouts are occurring. This might require adding request IDs to the reqs and passing those through LBs, proxies, etc., and adding to log messages on the STI side.

masih · 2022-09-07T14:46:32Z

All fixes for the storetheindex outage yesterday are now deployed. At this time it is unclear if the fixes would also resolve the timeouts observed when reframe was deployed. We can try deployment and see if it does if that's not too disruptive to the users in case it doesn't.

BigLep · 2022-09-07T14:50:42Z

Thanks for the updates guys. I'll keep following - let me know if anything is needed.

guseggert · 2022-09-15T13:01:46Z

Extensive update of metrics (added resource manager metrics, length of STI responses, etc.)
- Address reframe timeouts & add metrics #177
Pushed through the edelweiss change to allow "cachable" methods, which switches FindProviders from POST to GET so that it can be cached by a CDN
- ipld/edelweiss@fbd8dae
Plumbed that through go-delegated-routing
- ipfs/go-delegated-routing@30ca77f
Deployed to the Hydras
- libp2p/hydra-booster-infra@e0d0159
Updated dashboard

This is now deployed and operational, so closing.

petar · 2022-10-11T07:26:53Z

Yes.

…

On Tue, Jul 26, 2022 at 8:01 AM Steve Loeppky ***@***.***> wrote: @petar <https://github.com/petar> : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed? — Reply to this email directly, view it on GitHub <#162 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACFTS465KJWWGSRX5XSCO3VV74TRANCNFSM5QICOUOA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

This was referenced Mar 9, 2022

Add Reframe protocol server ipni/storetheindex#251

Closed

Configure delegated routing to invoke Indexers #141

Closed

Add a storetheindex delegated provider #158

Merged

BigLep added this to IPFS Shipyard Team Mar 9, 2022

BigLep moved this to 🥞 Todo in IPFS Shipyard Team Mar 9, 2022

BigLep added this to the go-ipfs 0.14 milestone Mar 9, 2022

BigLep mentioned this issue Mar 9, 2022

load test auto-generated server/clients as part of CI ipld/edelweiss#12

Closed

BigLep mentioned this issue Apr 4, 2022

delegated-routing support v1: use configurable delegated-routing endpoints for content-routing ipfs/kubo#8775

Closed

9 tasks

BigLep mentioned this issue May 5, 2022

Add Reframe provider to Hydra #166

Merged

BigLep assigned petar May 12, 2022

BigLep added this to go-libp2p May 17, 2022

BigLep removed this from go-libp2p May 17, 2022

BigLep moved this from 🥞 Todo to 🏃‍♀️ In Progress in IPFS Shipyard Team May 17, 2022

BigLep mentioned this issue May 26, 2022

add metrics to hydra reframe path #169

Merged

petar mentioned this issue May 27, 2022

add metrics to reframe endpoint ipni/storetheindex#536

Merged

BigLep mentioned this issue Aug 12, 2022

Clarify HTTP response status codes in reframe/REFRAME_HTTP_TRANSPORT.md for empty results ipfs/specs#308

Closed

BigLep mentioned this issue Aug 12, 2022

Hydra nodes forward non bitswap provider records to Kubo nodes #172

Open

BigLep assigned guseggert and unassigned petar Aug 12, 2022

BigLep removed this from the go-ipfs 0.14 milestone Aug 12, 2022

BigLep mentioned this issue Aug 19, 2022

Address reframe timeouts & add metrics #177

Merged

honghaoq mentioned this issue Aug 28, 2022

Hydra switch 100% prod to query indexer with reframe and reach comparable performance ipni/storetheindex#718

Closed

guseggert closed this as completed Sep 15, 2022

Repository owner moved this from 🏃‍♀️ In Progress to 🎉 Done in IPFS Shipyard Team Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to use delegated-routing for querying storetheindex #162

Update to use delegated-routing for querying storetheindex #162

BigLep commented Mar 9, 2022 •

edited by guseggert

Loading

petar commented May 27, 2022

BigLep commented May 28, 2022

BigLep commented Jul 24, 2022

petar commented Jul 25, 2022

BigLep commented Jul 26, 2022

petar commented Jul 26, 2022

willscott commented Jul 26, 2022

BigLep commented Aug 12, 2022

BigLep commented Aug 12, 2022

guseggert commented Aug 15, 2022

guseggert commented Aug 17, 2022

BigLep commented Aug 19, 2022 •

edited

Loading

BigLep commented Aug 19, 2022

guseggert commented Sep 7, 2022

guseggert commented Sep 7, 2022

masih commented Sep 7, 2022 •

edited

Loading

BigLep commented Sep 7, 2022

guseggert commented Sep 15, 2022

petar commented Oct 11, 2022 via email

Update to use delegated-routing for querying storetheindex #162

Update to use delegated-routing for querying storetheindex #162

Comments

BigLep commented Mar 9, 2022 • edited by guseggert Loading

Done Criteria

Why Important

Notes

Estimation Notes

petar commented May 27, 2022

BigLep commented May 28, 2022

BigLep commented Jul 24, 2022

petar commented Jul 25, 2022

BigLep commented Jul 26, 2022

petar commented Jul 26, 2022

willscott commented Jul 26, 2022

BigLep commented Aug 12, 2022

BigLep commented Aug 12, 2022

guseggert commented Aug 15, 2022

guseggert commented Aug 17, 2022

BigLep commented Aug 19, 2022 • edited Loading

BigLep commented Aug 19, 2022

guseggert commented Sep 7, 2022

guseggert commented Sep 7, 2022

masih commented Sep 7, 2022 • edited Loading

BigLep commented Sep 7, 2022

guseggert commented Sep 15, 2022

petar commented Oct 11, 2022 via email

BigLep commented Mar 9, 2022 •

edited by guseggert

Loading

BigLep commented Aug 19, 2022 •

edited

Loading

masih commented Sep 7, 2022 •

edited

Loading