Skip to content
This repository has been archived by the owner on Jun 11, 2024. It is now read-only.

Update to use delegated-routing for querying storetheindex #162

Closed
9 tasks done
Tracked by #8775
BigLep opened this issue Mar 9, 2022 · 19 comments
Closed
9 tasks done
Tracked by #8775

Update to use delegated-routing for querying storetheindex #162

BigLep opened this issue Mar 9, 2022 · 19 comments
Assignees

Comments

@BigLep
Copy link

BigLep commented Mar 9, 2022

Done Criteria

Updated 2022-08-11 to capture the latest state:

  • Hydras in production across the whole fleet query storetheindex using reframe rather than the storetheindex provider that was added in Add a storetheindex delegated provider #158
  • The custom storetheindex code in libp2p/hydra-booster is removed and deployed to production.
  • Hydra dashboards have metrics for their calls to storetheindex. We can answer these questions:
    • Number of calls Hydra made to STI (regardless if successful or not)
    • Number of calls that Hydra got a 2xx (success) response from STI (regardless if STI has providers for the given CID or not)
    • Number of calls that fataled on the server (e.g., 5xx due to server issue)
    • Number of calls where client timed out (and thus didn't get a server response)
    • Distribution of 2xx response payloads sizes (in terms of number of records). For each 2xx responses, we should accumulate a metric for the number of providers in the response. This allows us to say the the p90 of responses have X providers.
    • Latecy of each request, broken out by status code.

Why Important

Provides first production validation of delegated routing, giving us the confidence to add it to Kubo as part of ipfs/kubo#8775

Notes

  1. We will use the ipld/edelweiss generated version of ipfs/go-delegated-routing happening in Implementation of delegated routing based on the Edelweiss compiler  ipfs/go-delegated-routing#11
  2. The depends on storetheindex to expose a delegated-routing endpoint, happening in Add Reframe protocol server ipni/storetheindex#251
  3. https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test-.* is the Hydra dashboard that should be updated.
  4. Cases where storetheindex has 0 results for a given CID and the corresponding status code is an open spec item being clarified in Clarify HTTP response status codes in reframe/REFRAME_HTTP_TRANSPORT.md for empty results ipfs/specs#308.
  5. This is "Stage 0" in https://www.notion.so/pl-strflt/Indexer-IPFS-Reframe-Q3-Plan-77d0f3fa193b438784d2d85096e315e7
  6. We don't need to include/deploy the latest "read"-related functionality in the reframe spec, including HTTP caching. That will happen separately when Add cacheable GET endpoint for findProviders ipfs/go-delegated-routing#27 completes.

Estimation Notes

2022-08-19 estimates of work remaining:

@petar
Copy link
Contributor

petar commented May 27, 2022

@BigLep All tasks are done here, specifically:

  • Hydra has metrics for Reframe path
  • STI has metrics for Reframe path
  • All old STI code is removed from Hydra
  • Both Hydra and STI use the newest version of Delegated Routing and Edelweiss
  • We've verified that Hydra and STI talk to each other

The next steps would be:

  • deployment of STI to production (@willscott), then
  • deployment of Hydra to production (@petar)

@BigLep
Copy link
Author

BigLep commented May 28, 2022

Thanks @petar . Lets track the storetheindex production deployment in ipni/storetheindex#251 . Hydra production deployment will be here.

To be clear, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?

Good stuff - almost there!

@BigLep
Copy link
Author

BigLep commented Jul 24, 2022

@petar : following up, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?

@petar
Copy link
Contributor

petar commented Jul 25, 2022

@petar : following up, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?

Yes. The delegated routing code replaces the STI code and uses the same metric names. So the dashboard should work unchanged.

@BigLep
Copy link
Author

BigLep commented Jul 26, 2022

@petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?

@petar
Copy link
Contributor

petar commented Jul 26, 2022

@petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?

Yes.

@willscott
Copy link
Contributor

@thattommyhall I believe you pinged that we're still using the older protocol for all but the test instance - and that's what i see on the dashboard as well.

I didn't see the removal of the http indexer code go past on github, but noting that it means we probably should coordinate broader deployment of reframe before we end up with a deployment that doesn't support the current setup.

@BigLep
Copy link
Author

BigLep commented Aug 12, 2022

Discussion about this effort is currently happening in the #reframe channel: https://filecoinproject.slack.com/archives/C03RQFURZLM

@BigLep
Copy link
Author

BigLep commented Aug 12, 2022

Per 2022-08-12 verbal conversations, @guseggert is going to drive this effort to close and will consult with @petar as needed.

@guseggert
Copy link
Contributor

So the logs originally showed the timeouts were due to a timeout while reading the response body:

2022-08-14T09:37:46.803Z	ERROR	service/client/delegatedrouting	proto/proto_edelweiss.go:1234	client received error response (context deadline exceeded (Client.Timeout or context cancellation while reading body))

This morning I deployed 37dda22 to the test flight, which publishes more detailed error metrics for Reframe and also upgrades to [email protected] and Go 1.18. After deployment, the timeouts basically disappeared, and it's been baking for a few hours and the traffic is at similar levels now without timeouts. This leads me to believe that the issue is probably that the Hydra node was taking too long to read the response body due to some environmental issue (overload from some other work it was doing). Something in the go-libp2p or Go upgrades might have also alleviated the bottleneck, e.g. libp2p resource manager. We see similar timeouts in prod with the non-Reframe StoreTheIndex client, although not nearly at the same rate, but the test flight could have gotten unlucky and been placed into a hot partition, so it could still be the same issue.

My next step is to get some metrics into the dashboard on libp2p Resource Manager to see if it's throttling anything, and understand the impact of that on the network, see if we need to tweak limits, etc. I'm guessing that RM is throttling b/c the AddProvider rate is much lower, while the STI rate is the same.

Enumerating the options to mitigate overloading:

  • Add a rate limiter to cap the rate of AddProvider DHT calls
    • We should do this regardless of the other options, as this is the only way we can prevent nodes from being overloaded when traffic patterns change
    • Some calls will start failing for the other nodes that are calling AddProvider, what's the impact of this?
    • This might already be happening with the libp2p upgrade and libp2p resource manager
  • Reduce the number of heads that each node runs
    • This will increase the overall cost as we'll need to scale up the fleet to accomodate
    • Traffic pattern changes in the network could still cause overloads
  • Do some analysis on the nature of the calls to see if caching could alleviate the load
    • Are there hot CIDs/addrs that we could shed w/ caching?
    • Traffic pattern changes in the network could still cause overloads

@guseggert
Copy link
Contributor

I've integrated Resource Manager, added RM metrics, added them to the dashboard, and tweaked the RM limits to be low enough to throttle spikes but to generally allow most traffic. The test node is now operating at the same capacity as before, but with minimal timeouts. There are still occasional timeouts (about 0.3% of reqs). These are timeouts reading response headers, so this may be a server-side thing, although I will increase the client-side timeout to 3 seconds to allow for e.g. GC to run w/o causing timeouts.

I'm working on this branch: https://github.com/libp2p/hydra-booster/tree/feat/reframe-metrics

I'll get a PR worked up, and continue to let this bake today. If it looks okay tomorrow morning, I'll roll it out to the rest of the fleet.

@BigLep
Copy link
Author

BigLep commented Aug 19, 2022

2022-08-19 conversation:

@BigLep
Copy link
Author

BigLep commented Aug 19, 2022

@guseggert : other thoughts from looking at this after:

  1. I worry that it isn't going to be clear for anyone looking at "StoreTheIndex Reuests / Sec" what "Net", "NetTimeout", and "Other" mean. Can we maybe add an "info panel" (assuming something like that exists) with an explainer note and link to canonical information.
  2. Please handle Clarify HTTP response status codes in reframe/REFRAME_HTTP_TRANSPORT.md for empty results ipfs/specs#308 and ensure go-delegated-routing is doing the right thing.
  3. For the latency metrics, do we have other values besides average. For example, I'm curious what the p99/p100 is for "success".
  4. Did we do this from the done criteria: "Distribution of 2xx response payloads sizes (in terms of number of records). For each 2xx responses, we should accumulate a metric for the number of providers in the response. This allows us to say the the p90 of responses have X providers."

image

@guseggert
Copy link
Contributor

  1. is done.

I did 2) a couple weeks ago, see ipfs/specs@0b123fa

  1. done in 6967a65

  2. done in d48d898

@guseggert
Copy link
Contributor

Update: last week I deployed Reframe to the full Hydra fleet, but almost all reqs started timing out so I rolled it back. Have been debugging w/ @masih in between traveling.

Yesterday there was an STI event that caused the HTTP endpoint to behave a like the Reframe timeouts, so I'm working with @masih to understand the root cause. If it doesn't rule out Reframe, then I'll wait for the root cause fix and redeploy to see if it also works for Reframe, and if it doesn't then we'll need to do some req tracing through the infrastructure to see where exactly the timeouts are occurring. This might require adding request IDs to the reqs and passing those through LBs, proxies, etc., and adding to log messages on the STI side.

@masih
Copy link
Member

masih commented Sep 7, 2022

All fixes for the storetheindex outage yesterday are now deployed. At this time it is unclear if the fixes would also resolve the timeouts observed when reframe was deployed. We can try deployment and see if it does if that's not too disruptive to the users in case it doesn't.

@BigLep
Copy link
Author

BigLep commented Sep 7, 2022

Thanks for the updates guys. I'll keep following - let me know if anything is needed.

@guseggert
Copy link
Contributor

This is now deployed and operational, so closing.

Repository owner moved this from 🏃‍♀️ In Progress to 🎉 Done in IPFS Shipyard Team Sep 15, 2022
@petar
Copy link
Contributor

petar commented Oct 11, 2022 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants