-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign of the Search/Channels feature #3615
Comments
Mostly duplicate of #2455 |
Frequency distribution of words in all natural languages follow Zipf's law. Speakers generally optimize the term (word, phrase ...) that they assign to the denotate (the physical object/idea ...) for efficiency of communication.
So, the best thing about natural languages is that humans use them to name things, like movies, game, music, torrents... This means that when someone does a search query for a thing he already knows exists, he addresses it by the terms that identify that thing uniquely enough in the database of human knowledge. And when someone just wants something in some genre - he names it by the name of the genre. Everything in between ("that movie with this guy with cool beard") we leave to Google. |
FASD: A Fault-tolerant, Adaptive, Scalable, Distributed Search Engine (2003 master thesis) - almost exactly what I came up with. No works by that author afterward, though 😞 Associative search in peer to peer networks: Harnessing latent semantics (2006) A nice survey on the topic of decentralized search: "Survey of Research towards Robust Peer-to-Peer Networks: Search Methods" Hmmm... This stuff seems to be beaten to death in the first decade of 00`s. Still, no one truly succeeded. |
Design details for the new Channels/AllChannel/Search subsystemClass designAll 3 communities inherit from base Metadata Community, that implements ORM and basic metadata query messages:
The database interface is implemented with a separate MetadataStore Pony ORM-based Tribler module. The base Metadata Community class implements basic messages for asking a remote peer for metadata element based on some criteria. Child communities differ mostly by the types of reaction to metadata queries, thus creating different search domains. The channel data is downloaded in the form of a torrent stuffed with metadata in Tribler Serializer (Python Struct) format. MetadataMetadata entries are encapsulated in signed payloads and stored in a serialized format according to the following scheme:
Basically, the signature and the author's public key form a _ trustworthy gossip container_. Anything could be put in there, but we use it for packing various formats of metadata. The primary metadata format for a torrent is described below:
Notes on format:
Metadata sizeFor each metadata format, there are two versions: full and short.
🐍 AllChannel protocol 🐍It's like a "snakes and ladders" game, channel tags used as "snakes" or "ladders" to other channels. This requires nested channels support. 🐻 Channels protocol 🐻"TasteBuddies" are established according to seeded torrents, seeded metadata directories (channels), search queries (potential privacy data leak?). Special walker regularly queries buddies for metadata updates. 🐦 Search protocol 🐦Search engine will use the Summary prefix tree](https://sci-hub.tw/10.1109/NCA.2017.8171372). The queries could be naturally done in parallel. All the returned metadata entries are saved to the local store. Metadata torrentsMetadata torrent (channel torrent) is formed from concatenated serialized Metatada entries ( Content deletionWhen the user wants to delete a torrent from their channel, a special DELETE entry is added to the channel contents. This entry follows the basic Metadata serialized messages format (signatures, timestamps, etc.), and adds a single |
Summary prefix tree: An over DHT indexing data structure for efficient superset search (Nov 2017) - this is almost exactly what I came up with. The algorithm is able to efficiently look up for queries consisting of supersets of keywords. Its based on a cross between Bloom filters and DHTs. However, the idea is relatively simple: |
level 1: local search architecture is based on 2 components and their connection first release features (strictly limited):
|
Update procedure
Initially, metadata torrent is created by concatenating the list of metadata entries in a serialized form. Updates are stored Git-style: each change is represented as a separate file.
|
All of Allchannels can be removed. If we add to each channel the ability to push 1k content item + channel votes. Votes are simply signed +1s by users. Votes are responsibility of channel. It is incentive compatible to let channels work for their self promotion. Votes for channels are future proof: grow towards votes for tags per torrent, torrent duplicates, torrent bundles, etc. Cool? |
@synctext, the initial idea was exactly that: remove AllChannels completely. The method of dissemination of information about new channels is the Snakes gossip protocol:
It is important to note that this algorithm eventually leads to spreading root torrent collections, since tags point to the parent channel. Opportunistic update of metadataDepending on my relationships with the peer who sent me this MD, if he sent me the outdated MD, I could push to him the newer version of this MD's parent channel. On the other hand, if I receive some MD that refers to a newer version of the channel I already subscribed to, I could decide to do a preemptive update check on that channel, from the same friend. |
@synctext, the question is, where and how do we store the vote info? Of course, vote info could slowly travel the network in the form of a gossip. Trust algorithms ensure that no one would cheat on it. |
Yes, ideas seem to converge. Allchannel now has global broadcast of votes. Gossip of votes and filtering based on trust would be a clear required leap forward.. |
About basics of gossip protocol: If a sybil region votes for the things I like, and this vote help me filter info - I trust this vote. If a proven real person I know votes for things I don't like - I distrust the vote. One good example is "paid bloggers" phenomena: these guys maintain high-profile blogs, but sometimes inject paid material. The people who are OK with it form the blogger's auditory. Those who shun it just don't listen to the guy. We can't create the perfectly objective view of reality for everyone, but we can give a person the tools to make her own view of reality consistent. |
If-Modified-Since primitive for content discovery makes a lot of sense.
The only information you need to download a channel, sample torrents, and check votes is an
Sounds complex... In the past 12 years we have been live we gotten lots of spam. Features have been removed out of production because they where not spam-resilient. Our trust and reputation function are still very simplistic. The above idea does not sound like something that is ready for testing on forum.tribler.org by next release deadline of 1 August.
Those are perfect attack vectors for a denial-of-service attack, right? |
We had a pop-up developer meeting where we briefly discussed the basics and future of the existing AllChannel 2.0 mechanism (see #3489). This comment briefly describes what we came up with. Main goal: simplicity First, we make a distinction between a channel producer (creator) and consumer (a user that browses through the channel). The channel API (for channel consumers):
voting on a channel |
Full Text SearchWe will use FTS4/5 from SQLite. To take word morphology into account we will use Snowball stemmer. We will produce tokens for FTS on our own (because we don't want to develop a wrapper and a binary version of Snowball for SQLite). For the first versions of Chant we will still use Porter stemmer embedded in SQLite. Tags and title terms share the same search space. When added to the local database, the terms obtained from stemming the words of the torrent metadata |
Significant overlap with nearly finished code #3660 |
Mental note... This sprint aim to do a big leap forward for metadata. Similar how IPv8 collapsed the code base dramatically. By trying out a new approach for metadata, we make a solid step forward. Sprint deadline is 1 august 2018, functional release on forum.tribler.org |
Nested channels
Notes on security:
|
For reference, this also overlaps: #3484. |
As the design is becoming more clear, I propose investing some time in a dataset with 1 million real items and performance analysis. Build scientific proof for the Tribler stack, credibility, and good quality assurance infrastructure.
Open Science is a nice candidate to use as a dataset. For arXiv.org : Total number of submissions shown in graph as of June 15th, 2018 (after 26.9 years) = 1,403,151. Outdated stats for Amazon S3 download: The complete set of PDFs is about 270GB, source files about 190GB, and we make about 40GB of additions/updates each month (2012-02).
The overengeneering approach: "CORE’s aim is to aggregate Open Access content from across the world. We provide seamless access to over 30 million metadata records and 3 million full-text items accessible through a website, API and downloadable Data Sets." |
Gossip DB schemaWhen the gossip is deserialized from a file or a network message, it goes through several checks:
If the gossip passes these checks, it is processed and added to the local database. The database schema mimics the gossip format, but introduces some additional fields, like the addition moment timestamp and stemmed index. |
A bit of a memory dump; this is the old way of checking for duplicates:
The system worked like this, so that you could publish the same message twice. This is useful for messages which are not stored long term. That said, I don't think we need to support publishing the same message twice anymore and therefore your gossip design should work. |
The torrent for multistream (seekable) English Wikipedia articles archive is about 17GB of bz2-compressed data. It is assumed that the full unpacked size is about 80GB. Now come the updates... It is hard to estimate how many pages change daily on English Wikipedia, but the number of daily edits is about 200k, and the page churn is about 1k/day. So, assuming an average of 50 edits per page per day, about 20k pages change daily. That's 600k pages changed per month. Again, I did not find any info on the distribution of page retention time, but from those numbers, we can roughly estimate a monthly change rate of 10% of 6M articles. So, in the end, the whole English Wikipedia (text only) will take about 60GB on disk. It will take about 40 hours on a fast modern PC to upload it into the database in the current architecture. Wikipedia search results will be shown alongside other channels results. Updating Wikipedia will take about 1-2 hours per month on a fast PC. |
ToDo note for future. Seedbox activity. Limit to Max. 2-3 weeks of effort. As a university we want to showcase the strengths of our technology, while minimising engineering time. static dumps only ❗ Dynamic updates are too difficult.
|
Downloading of Arxiv's archive will costs us for about 100$ (based on information from https://github.com/mattbierbaum/arxiv-public-datasets#pdfs-aws-download-only) |
Seedbox activity is on pause now.
|
Channels are permissioned, that needs to change. Brainstorm target if tag experiment is successful: scientific journal without an editor-in-chief. Peer-review driven. Use markdown, picture and infinite amount of PDF attachment with crowd sourcing magic. The exact magic is our university goal: permissionless science. Leaderless science. Tag terms of both journal and individual PDF files become the -exclusive- terms for discoverability. Human created tag terms dramatically enhance scalability versus all keyword from abstract as search space for all scientific papers. Stepping stone for 'decentral Google' ;-) |
Permissionless crowdsourcing is incompatible with Channels 2.0 architecture that is based on torrents. Channels 3.0 architecture fits permissionless perfectly, though. |
great stuff 👏 Suggestions for next iteration:
|
The next prototype we designed: The number of snippets shown, and the number of torrents per snippets can be configured through global variables (default number of snippets shown: 1, default number of torrents per snippet: 4). We decided to keep the information in the snippet to a minimum. A torrent that is included in a snippet is not presented in another row as a "regular" torrent. |
Now that the GUI part of the snippet is mostly figured out, we had a brief discussion on the next step(s).
|
The upcoming sprint is a good idea to do a deep dive into the Linux query only. So do something that does not scale, manually tune and tweet just a single query and make dedicated regular expressions for it. Work until it is really perfect. Then everybody learned something from actual swarms, turning dirty data into perfect search. |
Sprint Update (Information Architecture for Tribler)PR #7070 contains the first version of the work described on both this ticket and #6214. This is also a key step towards #7064. Below are some of the insights/design decisions during this sprint. Last sprint we decided to focus on snippets that bundle torrents that describe the same content. This sprint focussed on building the storage backend. We decided to keep our design as simple as possible while not trying to limit ourselves too much for further iterations. We will in Tribler maintain a knowledge graph with content items, torrents and tags. See the visualisation below. We believe this structure is sufficient enough to improve search results by showing snippets, and for further iterations in which we can improve other aspects of Tribler. In the underlying database, this graph is stored as a collection of statements in the format We do not allow arbitrary predicates in the knowledge graph. To represent knowledge, we use the Dublin Core (DC) Metadata format and integrate it with the knowledge graph visualised above. This format gives us a set of attributes that are used to describe torrents, including year, data and format. The predicate field in each statement can take one of the 15 values given by DC. We also have two special predicate fields, namely TORRENT to describe content item - torrent relations, and TAG for backwards compatibility. In addition to the above, the tag generator has been updated to automatically generate tags for Ubuntu-related content. There are a few points we can focus on during the next sprint:
One of the design decisions is whether to add a FTS index to the strings in tuples. This would allow efficient searches in tuples (which I suspect will be a key aspect of querying the knowledge graph) but adds more complexity. We need an experiment to verify the speed-ups. |
Automatic Content Item generation (Ubuntu, Debian, Linux Mint): import re
from re import Pattern
from tribler.core.components.tag.rules.tag_rules_base import Rule, RulesList
space = r'[-\._\s]'
two_digit_version = r'(\d{1,2}(?:\.\d{1,2})?)'
def pattern(linux_distribution: str) -> Pattern:
return re.compile(f'{linux_distribution}{space}*{two_digit_version}', flags=re.IGNORECASE)
content_items_rules: RulesList = [
Rule(patterns=[pattern('ubuntu')],
actions=[lambda s: f'Ubuntu {s}']),
Rule(patterns=[pattern('debian')],
actions=[lambda s: f'Debian {s}']),
Rule(patterns=[re.compile(f'linux{space}*mint{space}*{two_digit_version}', flags=re.IGNORECASE)],
actions=[lambda s: f'Linux Mint {s}']),
]
|
Just a reminder to deeply understand what it means to be future proof
Lessons we never learned, but endured. 1) use mature libraries 2) REPEAT: avoid the latest cool immature tooling 3) never do a complete re-write 4) have an explicit roadmap with intermediaries goals 5) let the dev team make roadmap to |
As we are getting #7070 ready for merge, we made a few additional design decisions:
|
Sprint Update#7070 has been reviewed and merged, and is ready for deployment with our 7.13 release 🎉. 7.13 is scheduled to be released on Nov 1. Next steps:
|
Use cases.
We strive to provide the same kind of service that is provided by:
On fuzzy searches.
We do not try to become a distributed Google. Google is good at finding the knowledge that is relevant to words. It is used when the user does not know what he or she is looking for. Our user always knows what he or she wants. Therefore, we should disable metadata (or torrent contents) search by default, and only search by the torrent name. On the other hand, the user should be able to use metadata to refine the search results. Like, "sort by date of creation" or "sort by tag/channel", etc. This model already works perfectly for all use cases indicated above (except Youtube).
Data structurization.
We must be able to replicate the basic information organization structures that are used by our example use cases:
It is important to note that a single instance of the system is not required to serve all of these features at once. Instead, it should provide building blocks, or modules, to compose the use cases as necessary.
For an average user a single database could be used to serve every type of content, but, e.g., users interested in building a huge scientific database should be able to set up a separate database instance optimized for scientific articles.
Constraints on design choices
The design should consist of several completely independent parts:
It is very important that we do not make any of these parts depend on specific implementation of one another. We must be able to design the system in such a way, that it would be trivial to exchange one implementation of some part for another one.
User experience requirements:
Implementation requirements:
The text was updated successfully, but these errors were encountered: