Skip to content
This repository has been archived by the owner on Jun 11, 2024. It is now read-only.

Keep Hydra head peerIDs between restarts #128

Closed
aschmahmann opened this issue Jul 21, 2021 · 3 comments
Closed

Keep Hydra head peerIDs between restarts #128

aschmahmann opened this issue Jul 21, 2021 · 3 comments
Assignees
Labels
effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important kind/enhancement A net-new feature or improvement to an existing feature need/triage Needs initial labeling and prioritization

Comments

@aschmahmann
Copy link
Contributor

aschmahmann commented Jul 21, 2021

There doesn't seem to be a good reason for us to rotate our peerIDs (and therefore locations in the Kademlia keyspace) just because we OOM, update the version we're running, etc.

The negative effects of us rotating our keys are:

  1. If you're running a small number of heads then you're effectively making the records previously stored with you useless since no one will look for them with you
  2. If you're running many heads then you're invalidating a bunch of people's routing tables which can make clients less efficient. It'll all work itself out over time, but we might as well be nice

Solution for always reusing the same balanced IDs in the hydra deployment

The ID generator is a pseudorandom (i.e. seeded) algorithm that generates an infinite sequence of mutually-balanced IDs:
ID0, ID1, ID2, ID3, ...

Any one ID in this sequence is uniquely determined by the seed (which determines the sequence) and its index (i.e. sequence number) in the sequence.

Therefore, to ensure that a collection of Hydra heads (1) have mutually-balanced IDs and (2) they always reuse the same IDs (after restart), it suffices to parameterize each of them with the same seed and index at execution time, such that each of them has a different index in the space of positive integers.

For example, heads can be parameterized as:
id_seed=xyz, id_index=1
id_seed=xyz, id_index=2
id_seed=xyz, id_index=3
...

Note that it is irrelevant which machines or processes the heads run on.
The key requirement is that each head (across the entire fleet) gets a unique index!

Therefore, heads should be parameterized at the infra/deployment level, perhaps using command-line arguments. Restarting a head then guarantees it reuses the same ID and it is unique across the fleet.

Furthermore, this methodology enables easy (auto)scaling: Just assign unused index numbers to heads that are being added. (The space of positive integers is large enough!)

There is no requirement that indices are consequtive numbers (just that they are unique). This facilitates ops engineers to use different blocks of integers for different types of scaling purposes. For example, two entirely independent (with no coordination between them) hydra fleets can be deployed. For instance, the first fleet can use only even numbers for its heads; the second fleet can use only odd numbers for its fleet. Clearly, this example generalizes in various ways.

Note that this methodology completely alleviates the need for any kind of direct network coordination/connection between heads, making the system considerably more robust!

Progress

A first step in this direction is provided in #130.

@aschmahmann aschmahmann added effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important kind/enhancement A net-new feature or improvement to an existing feature need/triage Needs initial labeling and prioritization labels Jul 21, 2021
@dennis-tra
Copy link
Contributor

dennis-tra commented Jul 23, 2021

Hi @aschmahmann ,

For the past ~2 weeks I was constantly running my crawler and noticed some things that could be of interest for this issue.

  1. There is a significant fraction of provider records in the DHT that yield "Peer ID mismatch" errors when the crawler tries to connect. Correct me if I'm wrong: this can only happen if the PeerID was rotated while retaining the same host/port combination (or multi-address in general). In the screenshot below these errors constitute 13% of all connection errors.
    image
    The day before yesterday the ratio was 26 %. My data indicates that there was something going on (deployment?) with the hydra boosters around midnight 22.07. - but that's not relevant here I guess.

  2. The list below shows the redacted top IP addresses and their corresponding distinct peer IDs. So, an IPFS host at the first IP address in that list was online with over 5000 different Peer IDs over the course of ~6 days. When running ipfs swarm peers | grep <ip address> and then ipfs id <PeerID> it gave me the hydra agent version for the top 8 IPs.

        maddrs     | count
-------------------+-------
 /ip4/138.xx.xx.xx |  5194
 /ip4/138.xx.xx.xx |  4951
 /ip4/165.xx.xx.xx |  4822
 /ip4/138.xx.xx.xx |  4487
 /ip4/138.xx.xx.xx |  4197
 /ip4/138.xx.xx.xx |  4005
 /ip4/165.xx.xx.xx |  3872
 /ip4/165.xx.xx.xx |  3786
 /ip4/138.xx.xx.xx |  3134
 /ip4/138.xx.xx.xx |  2945
 /ip4/138.xx.xx.xx |  1627
 /ip4/138.xx.xx.xx |  1627
 /ip4/159.xx.xx.xx |   223
  1. There is a huge number of undialable peers in the DHT:
    image
    This could be partly (13 - 26 %) related to the hydra nodes rotating their PeerIDs as the records will stay in the DHT for up to 24 h.

@petar
Copy link
Contributor

petar commented Jul 23, 2021

Yep. This confirms our observation that in the past 2 weeks hydras were restarting constantly and therefore rotating their IDs.

@aschmahmann
Copy link
Contributor Author

closed by #130

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important kind/enhancement A net-new feature or improvement to an existing feature need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

3 participants