-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2021 Theme Proposal] Scalable decentralized content routing #76
Comments
Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review.
Finally, remember to use https://discuss.ipfs.io if you just need general support. |
What if the DHT is inherently unscalable? Maybe an investigation of these ideas about having incentivized supernodes that provide indexes will prove more fruitful for the future workability of the project. |
@fiatjaf I see no evidence that there can never be a scalable DHT. There has not been that much work put into the IPFS version yet, and there are some quite promising papers. That being said, you are right that there should possibly also be a "plan B" track which tries to come up with a pragmatic mechanism to speed up content discovery in the case that one big DHT for the data of the world is not viable. This could be some kind of supernodes that are incentivized to keep an index, as also suggested by @jbenet at some point. |
👍, however, I think the issue is more about scalable and composable content routing than scaling the public DHT.
TLDR: The DHT today is generally pretty snappy at finding content, what's more problematic is publishing. I'm pretty sure there is some pretty achievable work there including:
However, this still doesn't answer some questions around what happens when people just have a lot of content to publish. If Google decided they wanted to make all of the 100s of billions of web pages they have indexed available over IPFS providing them in the public DHT would seem a little crazy. Even if they were only publishing the root CIDs that's still a lot of data they're dumping into the network. In fact, given that this content provider is so big and has so much data it is already a piece of centralized infrastructure. So perhaps for large centralized content providers it's actually more efficient for users to ping some more stable service, whether decentralized, federated, or centralized (i.e. so Google doesn't have to republish their provider records regularly and they can just keep them stable for a long time). This is one of the reasons I think composability and ease of extensibility is so important here. Perhaps we can build towards a single content routing system that works well in all use cases, but it seems very unlikely that it would be optimal in all of those use cases. |
@aschmahmann you are right that the title of this theme proposal is probably overly constraining. It should probably be called something like If you look at it from afar, the information that needs to be stored for content discovery is a giant 2d matrix/bitfield with pieces of content on one axis and peers on the other. But this bitfield is extremely sparse and also heavily correlated. So it should be possible to compress it quite well, and then somehow shard the compressed info. An approach that I think is insufficiently explored is to use probabilistic data structures such as bloom filters or cuckoo filters to store a rough approximation of the cid to peer map, and then reserve a slower mechanism for the few % failure rate. I talked about this with @whyrusleeping during last IPFS camp. Basically have a gossip mechanism for the content of this yet to be defined probabilistic data structure. But I guess this is getting too far into implementation details, the purpose of the ticket should be to discuss if this should be an area of focus. Note that I got decentralized in the new title so nobody can suggest just having a huge database of cids to peers on AWS... :-) That does not mean that all solutions have to be fully decentralized. If it turns out we need supernodes that is fine as well. |
@rklaehn @aschmahmann @fiatjaf great issue and discussion thank you. Agree totally with everyone's thoughts re scalable/smarter/muilti-level DHT, non-deterministic routing, etc. I have just submitted a theme #79 to take them even further, based on our work of the last 10 years on real time video content distribution via the Internet. We strongly believe that combining our respective technologies will have significant benefit to our projects and to the Internet in general. We submitted a proposal for IPFS R&D RFPs 7 & 8 earlier this year to @daviddias & co but this may be a much more appropriate forum, given the scale of our proposal. Our thinking has evolved a lot since then and the market has happily "come to meet us". @jonnycrunch is an adviser to us and Jaime Llorca is a co-author of our white paper and patents https://scholar.google.com/citations?user=KSI2DE0AAAAJ&hl=en Looking forward very much to a robust and very interesting discussion! |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Note, this is part of the 2021 IPFS project planning process - feel free to add other potential 2021 themes for the IPFS project by opening a new issue or discuss this proposed theme in the comments, especially other example workstreams that could fit under this theme for 2021. Please also review others’ proposed themes and leave feedback here!
Theme description
The current IPFS DHT has had some improvements this year. But it is still not capable of handling all the content of the world in a content-addressed way. The focus this year should be on scaling the DHT to be capable of addressing petabytes of data with small granularity.
Hypothesis
Growth of IPFS is not hindered by bad user experience or lack of awareness, but by scalability problems.
Whenever there is a discussion about IPFS on popular developer discussion forums like hacker news, everybody loves the idea and gets why a world wide content-addressed system would be amazing. But there are always lots of people that say that while they like the idea, they tried it and it did not work well for them.
For myself, I find that the only reliable way to put a large piece of content such as an IPFS meetup video on ipfs is to pin it on one of the pinning services like Pinata. Putting it on a small EC2 instance is insufficient, since the content does not get found by the DHT quickly enough. I am very thankful for Pinata, but nevertheless this is antithetical to the idea of a distributed web.
So my hypothesis is that all that is needed for rapid exponential growth of IPFS is IPFS actually being able to handle said explosive growth.
Vision statement
IPFS would really be capable of handling the world's data. I should be able to put content like video files on the most tiny connected machine possible, and be able to serve an arbitrary number of content consumers.
Why focus this year
With the launch of Filecoin and increasing awareness of the fragility of centralized solutions, there is sufficient awareness of solutions like IPFS. However, it is essential that IPFS is up to the challenge when people with demanding use cases spend a few hours of their time trying it out.
Example workstreams
test implementations of DHT improvements from academia in at large scale using tools like https://docs.testground.ai/
The result of this track would be, at minimum, a paper providing real world data for various proposed DHT improvements. I am sure this would contain some surprises. At best, the result would be that a method to drastically improve DHT performance would be identified and validated, so it could be implemented in production quality next year.
identify and implement easy wins / low hanging fruits to make the DHT better able to handle the load, before breakthroughs from the academic research track arrive.
There should be some way to get some significant constant factor performance improvements just by optimizing performance and data usage for the current single level kademlia DHT.
build an environment based on https://docs.testground.ai/ that allows friendly competiton between the existing ipfs implementations, like https://github.com/ipfs/go-ipfs, https://github.com/ipfs/js-ipfs, https://github.com/rs-ipfs/rust-ipfs, https://github.com/ipfs-rust/ipfs-embed
The goal of this should be to make it extremely easy for even small, unfunded projects to do large scale performance testing to validate/invalidate performance improvement approaches at relevant scale.
The end result of this track would be to have a performance competition at the end of 2021 with significant USD or FIL rewards.
Other content
One thing that is orthogonal to the DHT improvements but also needs doing to simplify adoption is to come up with a good API for IPFS. The current REST api was a good first attempt, back when @jbenet came up with it many years ago. But it is not a very good REST API. It feels like just some way to expose the go-ipfs cli commands via a HTTP API. It also has some serious issues in the area of pinning.
So either embrace REST and come up with a proper RESTful api, or go with something else entirely (REST as JSON RPC, GraphQL, ...) that is properly designed to guide users into correct usage
The text was updated successfully, but these errors were encountered: