-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The relevance ranking algorithm when searching in Tribler needs to be redesigned #2250
Comments
@whirm I need this for the new GUI and I almost have a proposal ready. Can you adjust the milestone and assign the issue to me? |
A first implementation of the algorithm as outlined above can be found here: https://gist.github.com/devos50/71b44339914259c82515f21846ee8173. Some minor tweaking might have to be done to make relevance ranking even better. I will not implement this for the old GUI but rather make the REST API use the new algorithm. |
nice! |
It might be worth considering that names aren't the only important thing a keyword search may be looking for - a 'deep search' of the content of files, too, used to be possible in earlier p2p networks (especially within special parts of files, like ID3 tags) and may yet be an important feature, long term to keep our eyes on. Another thing worth thinking about is whether something like 'pagerank' could be relevant - whereas we have categories/channels that contain torrents, if you search for, for example 'thrones' |
@themusicgod1 Yes, we need accurate rich metadata, proper pageranking/itemranking, and good ID3 stuff. btw. Worked on this myself 16 years ago, sadly never got solved (http://cd.ro.nu/hypermail/1046.html). Maybe an extra 16 years:-) |
Tribler Search Engine - channel expansionWe can finally start doing relevance ranking. The popularity community is producing results and channels should be getting real-time browsing after reset! 🎊 First priority is getting the above 5 heuristics implemented. Next heuristics, more invasive to QT: |
Yep. Users were even asking for something like that. |
During the last three weeks, I was finally able to work on the search quality improvements and have some interesting results. I can do a live demo tomorrow. First, I implemented a new torrent_rank(query, title, seeders, freshness) -> rank
The new function rates exact title match relatively high, but not the highest without the information about seeders and freshness: torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81 When the number of seeders is known, the rank is higher, even better if the torrent creation date is known: torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=100) -> 0.855
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=100, freshness=100*DAYS) -> 0.8769230769230768 It is better for a torrent to have more seeders and to be recent: torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=100, freshness=100*DAYS) -> 0.8769230769230768
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=200, freshness=100*DAYS) -> 0.8923076923076921
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=300, freshness=100*DAYS) -> 0.8999999999999999
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=300, freshness=30*DAYS) -> 0.9262499999999999
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=300, freshness=10*DAYS) -> 0.9506249999999999 If a torrent has creation date in the future and its freshness is negative, the freshness is ignored: torrent_rank('Big Buck Bunny', 'Big Buck Bunny', freshness=10*DAYS) -> 0.8775
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', freshness=-10*DAYS) -> 0.81 Now, let's concentrate on the title ranking. The exact match is good: torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81 If the title contains some excess words at the end, the rank is slightly worse: torrent_rank('Big Buck Bunny', 'Big Buck Bunny II') -> 0.8038167938931299 The more excess words we have in the title, the lower is the rank: torrent_rank('Big Buck Bunny', 'Big Buck Bunny Movie Creation History') -> 0.7917293233082708 If the title contains in-between words, the rank is lower. The closer extra words are to the start of the title, the bigger the penalty. The biggest penalty is when extra words appear at the start of the title: torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81
torrent_rank('Big Buck Bunny', 'Big Buck Bunny II') -> 0.8038167938931299
torrent_rank('Big Buck Bunny', 'Big Buck Brown Bunny') -> 0.7506109979633401
torrent_rank('Big Buck Bunny', 'Big Bad Buck Bunny') -> 0.7424206815511163
torrent_rank('Big Buck Bunny', 'Boring Big Buck Bunny') -> 0.7312500000000001 The more extra in-between words the title has, the lower the rank: click to expandtorrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81
torrent_rank('Big Buck Bunny', 'Big Buck Bunny a') -> 0.8038167938931299
torrent_rank('Big Buck Bunny', 'Big Buck Bunny a b') -> 0.7977272727272727
torrent_rank('Big Buck Bunny', 'Big Buck Bunny a b c') -> 0.7917293233082708
torrent_rank('Big Buck Bunny', 'Big Buck a Bunny') -> 0.7506109979633401
torrent_rank('Big Buck Bunny', 'Big Buck a b Bunny') -> 0.6993358633776092
torrent_rank('Big Buck Bunny', 'Big Buck a b c Bunny') -> 0.6546181172291297
torrent_rank('Big Buck Bunny', 'Big a Buck Bunny') -> 0.7424206815511163
torrent_rank('Big Buck Bunny', 'Big a b Buck Bunny') -> 0.6852494577006508
torrent_rank('Big Buck Bunny', 'Big a b c Buck Bunny') -> 0.6362537764350453
torrent_rank('Big Buck Bunny', 'a Big Buck Bunny') -> 0.7312500000000001
torrent_rank('Big Buck Bunny', 'a b Big Buck Bunny') -> 0.6664556962025318
torrent_rank('Big Buck Bunny', 'a b c Big Buck Bunny') -> 0.6122093023255814 The wrong order of words in the title imposes a penalty: torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81
torrent_rank('Big Buck Bunny', 'Big Bunny Buck') -> 0.7476923076923077 The penalty is very high if some words are missed from the title. The closer the missed word to the beginning of the query, the bigger the penalty. Several missed words lead to a huge penalty: torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81
torrent_rank('Big Buck Bunny', 'Big Buck') -> 0.47250000000000003
torrent_rank('Big Buck Bunny', 'Big Bunny') -> 0.44181818181818183
torrent_rank('Big Buck Bunny', 'Buck Bunny') -> 0.405
torrent_rank('Big Buck Bunny', 'Bunny') -> 0.2858823529411765 If the query string or title string contains duplicates, the ranking function correctly takes them into account. If duplicates are in the query string, they should also be in the title, or the title receives a major penalty. If duplicates are in the title but not in the query, it gives a slight penalty to the result (and the order is also important): torrent_rank('Run Bunny Run', 'Run Bunny Run') -> 0.81
torrent_rank('Run Bunny', 'Run Bunny Run') -> 0.8033057851239669 # excess word in the title
torrent_rank('Run Bunny Run', 'Run Run Bunny') -> 0.7476923076923077 # wrong order
torrent_rank('Run Bunny Run', 'Run Bunny') -> 0.47250000000000003 # missed word in the title When title rank is not ideal, having many seeders and the recent torrent creation date is still useful. The torrent with a non-exact matching title can be higher in the search result list if it has a huge number of seeders. But if the title match is bad, having a huge number of seeders does not help: torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=5, freshness=365*DAYS) -> 0.8211573236889693
torrent_rank('Big Buck Bunny', 'Big Buck Bunny II', seeders=5, freshness=365*DAYS) -> 0.8148889471722596
torrent_rank('Big Buck Bunny', 'Big Buck Bunny II', seeders=1000, freshness=30*DAYS) -> 0.934177654406662
torrent_rank('Big Buck Bunny', 'Bunny', seeders=5, freshness=365*DAYS) -> 0.28982023189022443
torrent_rank('Big Buck Bunny', 'Bunny', seeders=1000, freshness=30*DAYS) -> 0.33224598930481275 Now some more examples that @synctext gives:
torrent_rank('Sintel', 'Sintel') -> 0.81
torrent_rank('Sintel', 'Sintel', seeders=100) -> 0.855
torrent_rank('Sintel', 'Sintel', seeders=1000) -> 0.8918181818181817
torrent_rank('Sintel', 'Sintel') -> 0.81
torrent_rank('Sintel', 'the.script.from.the.movie.Sintel.pdf') -> 0.5210526315789475
torrent_rank('Sintel', 'Sintel-Part-II') -> 0.7955357142857143
torrent_rank('Sintel', 'Part-of-Sintel') -> 0.6649253731343284
torrent_rank("The Internet's Own Boy: The Story of Aaron Swartz", "The Internet's Own Boy: The Story of Aaron Swartz") -> 0.81
torrent_rank("The Internet's Own Boy: The Story of Aaron Swartz", "The Internet's Boy: The Story of Aaron Swartz") -> 0.4984615384615385
torrent_rank("The Internet's Own Boy: The Story of Aaron Swartz", "The Story of Aaron Swartz") -> 0.1915720319099015
torrent_rank("Internet's Own Boy", "Internet's Own Boy") -> 0.81
torrent_rank("Internet's Own Boy", "Internet's Own Boy - Aaron Swartz") -> 0.7985915492957747
torrent_rank("Internet's Own Boy", "Internet's Very Own Boy") -> 0.7509933774834436
torrent_rank("Internet's Own Boy", "Internet's Very Special Boy Person") -> 0.43531669865642997 As the ranking function is integrated directly into the SQL query, it works fast. For me, the difference with the previous search algorithm looks very significant - now, the query to the local database returns precisely the result I expect it to give me. Remote queries will start receiving improved results after a version of Tribler with new ranking functions is officially released and installed by users. I hope to have a chance to demonstrate the proposed search algorithm tomorrow at the lab (with some UI improvements). |
Regarding the ideal search UI, I think it should satisfy the following guidelines:
I'll try to demonstrate the UI prototype that satisfies these guidelines. |
The screenshot demonstrates how the proposed search UI looks. It is very simple and does not contain any complicated UI elements. After a user types in a query string, it immediately sees local results (within seconds). If later remote results bring some new torrents, a search title starts displaying the number of remotely received torrents and updates this number dynamically. The user can click on the title of the search page to immediately add received torrents to the result list. In that case, new torrents are sorted according to the rank and inserted in the proper places. In the current draft implementation, these new torrents have the |
progress since 6 years! let not be afraid of a single refresh, users hate to click on things for them to be useful. Search: display local results and don't wait for remote search. Wait maximum 2000 ms for remote search to come in? Data driven:
|
Reminder, in the future we would like to do live searches. |
Solid progress, now we have dark SQL magic 🥳 We don't have any theoretically grounded algorithms to implement, first hire new phd student on this. ToDo document magic constructs like For 2022 and 2023 we want to improve search sophistication. We perceive 6 distinct levels (currently moving from level 1 to level 2). However, understanding what is going on in a distributed system composed of strangers with abusive spammers is known to be the hard problem. Reliable reproducing is essential.
|
@mg98 old 2016 issue, relevant for phd research. |
When searching in Tribler, search results (which can consist of torrents and channels) are ranked according to several criteria. Channels and torrents are sorted separately from each other. The associated channels of the torrents are also determined and in case a torrent is present in more than one channel, the channel with the highest number of votes is taken.
Channels are sorted based on the number of torrents in the channel. The algorithm to sort torrents is more complex. First, an array of five scores are constructed. These values are determined as follows:
After the torrents and channels are sorted, the results are put in a list. Every five torrent results, a channel result is inserted. This is to make sure that channels are also showing up in the search results (and more important, are visible to the user).
Also, I think it’s better to not be dependent on other search results to make streaming torrents from the database possible (like we do now when normalising). In other words, the final relevance score in the new algorithm should only be dependent on the data that this torrent object has.
I will start thinking about a new (more simple) algorithm that we could use for this and add the proposal in this issue.
Possible improvements:
Proposal for a new algorithm:
The new algorithm should take the matching with keywords better into account and be less dependent on other (confusing) variables such as number of seeders and number of torrents in the channel.
The above algorithm takes all of the improvements into account I outlined above.
The text was updated successfully, but these errors were encountered: