[2021 Theme Proposal] Scalable decentralized content routing #76

rklaehn · 2020-12-01T08:50:39Z

Note, this is part of the 2021 IPFS project planning process - feel free to add other potential 2021 themes for the IPFS project by opening a new issue or discuss this proposed theme in the comments, especially other example workstreams that could fit under this theme for 2021. Please also review others’ proposed themes and leave feedback here!

Theme description

The current IPFS DHT has had some improvements this year. But it is still not capable of handling all the content of the world in a content-addressed way. The focus this year should be on scaling the DHT to be capable of addressing petabytes of data with small granularity.

Hypothesis

Growth of IPFS is not hindered by bad user experience or lack of awareness, but by scalability problems.

Whenever there is a discussion about IPFS on popular developer discussion forums like hacker news, everybody loves the idea and gets why a world wide content-addressed system would be amazing. But there are always lots of people that say that while they like the idea, they tried it and it did not work well for them.

For myself, I find that the only reliable way to put a large piece of content such as an IPFS meetup video on ipfs is to pin it on one of the pinning services like Pinata. Putting it on a small EC2 instance is insufficient, since the content does not get found by the DHT quickly enough. I am very thankful for Pinata, but nevertheless this is antithetical to the idea of a distributed web.

So my hypothesis is that all that is needed for rapid exponential growth of IPFS is IPFS actually being able to handle said explosive growth.

Vision statement

IPFS would really be capable of handling the world's data. I should be able to put content like video files on the most tiny connected machine possible, and be able to serve an arbitrary number of content consumers.

Why focus this year

With the launch of Filecoin and increasing awareness of the fragility of centralized solutions, there is sufficient awareness of solutions like IPFS. However, it is essential that IPFS is up to the challenge when people with demanding use cases spend a few hours of their time trying it out.

Example workstreams

DHT research:
test implementations of DHT improvements from academia in at large scale using tools like https://docs.testground.ai/

The result of this track would be, at minimum, a paper providing real world data for various proposed DHT improvements. I am sure this would contain some surprises. At best, the result would be that a method to drastically improve DHT performance would be identified and validated, so it could be implemented in production quality next year.

short term DHT improvements
identify and implement easy wins / low hanging fruits to make the DHT better able to handle the load, before breakthroughs from the academic research track arrive.

There should be some way to get some significant constant factor performance improvements just by optimizing performance and data usage for the current single level kademlia DHT.

IPFS performance testground
build an environment based on https://docs.testground.ai/ that allows friendly competiton between the existing ipfs implementations, like https://github.com/ipfs/go-ipfs, https://github.com/ipfs/js-ipfs, https://github.com/rs-ipfs/rust-ipfs, https://github.com/ipfs-rust/ipfs-embed

The goal of this should be to make it extremely easy for even small, unfunded projects to do large scale performance testing to validate/invalidate performance improvement approaches at relevant scale.

The end result of this track would be to have a performance competition at the end of 2021 with significant USD or FIL rewards.

Other content

One thing that is orthogonal to the DHT improvements but also needs doing to simplify adoption is to come up with a good API for IPFS. The current REST api was a good first attempt, back when @jbenet came up with it many years ago. But it is not a very good REST API. It feels like just some way to expose the go-ipfs cli commands via a HTTP API. It also has some serious issues in the area of pinning.

So either embrace REST and come up with a proper RESTful api, or go with something else entirely (REST as JSON RPC, GraphQL, ...) that is properly designed to guide users into correct usage

welcome · 2020-12-01T08:50:41Z

Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review.
In the meantime, please double-check that you have provided all the necessary information to make this process easy! Any information that can help save additional round trips is useful! We currently aim to give initial feedback within two business days. If this does not happen, feel free to leave a comment.
Please keep an eye on how this issue will be labeled, as labels give an overview of priorities, assignments and additional actions requested by the maintainers:

"Priority" labels will show how urgent this is for the team.
"Status" labels will show if this is ready to be worked on, blocked, or in progress.
"Need" labels will indicate if additional input or analysis is required.

Finally, remember to use https://discuss.ipfs.io if you just need general support.

fiatjaf · 2020-12-02T11:32:01Z

What if the DHT is inherently unscalable?

Maybe an investigation of these ideas about having incentivized supernodes that provide indexes will prove more fruitful for the future workability of the project.

rklaehn · 2020-12-02T13:13:44Z

@fiatjaf I see no evidence that there can never be a scalable DHT. There has not been that much work put into the IPFS version yet, and there are some quite promising papers.

That being said, you are right that there should possibly also be a "plan B" track which tries to come up with a pragmatic mechanism to speed up content discovery in the case that one big DHT for the data of the world is not viable.

This could be some kind of supernodes that are incentivized to keep an index, as also suggested by @jbenet at some point.

aschmahmann · 2020-12-02T13:17:35Z

👍, however, I think the issue is more about scalable and composable content routing than scaling the public DHT.

I see no evidence that there can never be a scalable DHT

TLDR:
Yes, but I think it's unlikely that even with more scaling approaches a single public DHT is the best answer in every scenario.

The DHT today is generally pretty snappy at finding content, what's more problematic is publishing. I'm pretty sure there is some pretty achievable work there including:

DHT algorithmic tweaks
- Making the algorithm more tolerable of a few unreachable peers (the content finding algorithm is essentially more tolerant than publishing, we could likely be flexible on the publishing if we do it correctly)
- Increasing/caching the information on more DHT peers
Put smarter, not more, information in the DHT including
- instead of providing the CIDs for the bytes in the middle of a ZIP file just provide the root
- make manually providing and creating custom strategies easier (these are both doable now, just not very easily)

However, this still doesn't answer some questions around what happens when people just have a lot of content to publish.

If Google decided they wanted to make all of the 100s of billions of web pages they have indexed available over IPFS providing them in the public DHT would seem a little crazy. Even if they were only publishing the root CIDs that's still a lot of data they're dumping into the network. In fact, given that this content provider is so big and has so much data it is already a piece of centralized infrastructure. So perhaps for large centralized content providers it's actually more efficient for users to ping some more stable service, whether decentralized, federated, or centralized (i.e. so Google doesn't have to republish their provider records regularly and they can just keep them stable for a long time).

This is one of the reasons I think composability and ease of extensibility is so important here. Perhaps we can build towards a single content routing system that works well in all use cases, but it seems very unlikely that it would be optimal in all of those use cases.

rklaehn · 2020-12-02T13:43:51Z

@aschmahmann you are right that the title of this theme proposal is probably overly constraining. It should probably be called something like Scalable decentralized content discovery in order not to overly constrain the solution space.

If you look at it from afar, the information that needs to be stored for content discovery is a giant 2d matrix/bitfield with pieces of content on one axis and peers on the other. But this bitfield is extremely sparse and also heavily correlated. So it should be possible to compress it quite well, and then somehow shard the compressed info.

An approach that I think is insufficiently explored is to use probabilistic data structures such as bloom filters or cuckoo filters to store a rough approximation of the cid to peer map, and then reserve a slower mechanism for the few % failure rate.

I talked about this with @whyrusleeping during last IPFS camp. Basically have a gossip mechanism for the content of this yet to be defined probabilistic data structure.

But I guess this is getting too far into implementation details, the purpose of the ticket should be to discuss if this should be an area of focus.

Note that I got decentralized in the new title so nobody can suggest just having a huge database of cids to peers on AWS... :-) That does not mean that all solutions have to be fully decentralized. If it turns out we need supernodes that is fine as well.

RhettSampson · 2020-12-05T06:21:27Z

@rklaehn @aschmahmann @fiatjaf great issue and discussion thank you. Agree totally with everyone's thoughts re scalable/smarter/muilti-level DHT, non-deterministic routing, etc.

I have just submitted a theme #79 to take them even further, based on our work of the last 10 years on real time video content distribution via the Internet. We strongly believe that combining our respective technologies will have significant benefit to our projects and to the Internet in general. We submitted a proposal for IPFS R&D RFPs 7 & 8 earlier this year to @daviddias & co but this may be a much more appropriate forum, given the scale of our proposal. Our thinking has evolved a lot since then and the market has happily "come to meet us".

@jonnycrunch is an adviser to us and Jaime Llorca is a co-author of our white paper and patents https://scholar.google.com/citations?user=KSI2DE0AAAAJ&hl=en

Looking forward very much to a robust and very interesting discussion!

github-actions · 2023-09-24T00:06:46Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

rklaehn added the 2021 Theme Proposal label Dec 1, 2020

rklaehn assigned atopal and dchoi27 Dec 1, 2020

rklaehn mentioned this issue Dec 1, 2020

[2021 Theme Proposal] Scalability #69

Closed

rklaehn changed the title ~~[2021 Theme Proposal] Scaling the DHT~~ [2021 Theme Proposal] Scalable decentralized content routing Dec 2, 2020

RhettSampson mentioned this issue Dec 5, 2020

[2021 Theme Proposal] Improve and expand IPFS to global hyper scale distributed content storage and real time delivery #79

Closed

bertrandfalguiere mentioned this issue Dec 9, 2020

[2021 Theme Proposal] Push for mainstream browser integration #81

Closed

github-actions bot added the Stale label Sep 24, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2021 Theme Proposal] Scalable decentralized content routing #76

[2021 Theme Proposal] Scalable decentralized content routing #76

rklaehn commented Dec 1, 2020 •

edited

Loading

welcome bot commented Dec 1, 2020

fiatjaf commented Dec 2, 2020

rklaehn commented Dec 2, 2020

aschmahmann commented Dec 2, 2020

rklaehn commented Dec 2, 2020 •

edited

Loading

RhettSampson commented Dec 5, 2020

github-actions bot commented Sep 24, 2023

[2021 Theme Proposal] Scalable decentralized content routing #76

[2021 Theme Proposal] Scalable decentralized content routing #76

Comments

rklaehn commented Dec 1, 2020 • edited Loading

Theme description

Hypothesis

Vision statement

Why focus this year

Example workstreams

Other content

welcome bot commented Dec 1, 2020

fiatjaf commented Dec 2, 2020

rklaehn commented Dec 2, 2020

aschmahmann commented Dec 2, 2020

rklaehn commented Dec 2, 2020 • edited Loading

RhettSampson commented Dec 5, 2020

github-actions bot commented Sep 24, 2023

rklaehn commented Dec 1, 2020 •

edited

Loading

rklaehn commented Dec 2, 2020 •

edited

Loading