-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random should be based off suggestions' pool #404
Comments
@veloman-yunkan Any opinion on this? Would you know how to make a random feature based on the title index? |
@kelson42 I will look into it |
Unfortunately, currently in the title index the articles without a title participate with their path (url) used instead of the title. As a result articles with real titles are not guaranteed to form a continuous range. Because of this an effective and effiecient approach for randomly picking up a titled article is not possible (without creating an additional index when the ZIM file is opened). The current algorithm can be trivially modified to only return articles with a title, however it may fail to do so within the set limit on the number of attempts (which is now equal to 42). @mgautierfr Do you plan to change in libzim2 the way the title index is built (so that articles without titles are excluded from it)? |
|
Actuallyimplementing should be trivial, just computes the number of entries base on the offset of urlpointerlst and titlepointerkst divided by 4. Then find a number in that range, read the 4 bytes entry. |
No, because the title index in the ZIM file is sorted according to title-or-url-if-no-title. See: https://github.com/openzim/libzim/blob/6.1.8/src/writer/_dirent.h#L81 https://github.com/openzim/libzim/blob/6.1.8/src/writer/_dirent.h#L170-L174 |
This is true but only only the entries which have a url=title are in the title index, so this does not matter here. All of this is just an optim to avoid to store the same string two times in the dirent. |
I doubt it. https://github.com/openzim/libzim/blob/6.1.8/src/fileimpl.cpp#L377-L386 suggests that the size of the tile index is equal to the full count of articles, i.e. all articles are in the title index. |
Ultimatively, the ZIM creator should decide what comes in the title index... but for the moment no image should be in any title index. It seems you are able to find them, so please can you open a bug with the details, because this is not normal. |
Shouldn't openzim/libzim#208 be reopened instead? |
Not as long that this no proven to be a probleme here. Please give a concrete example of an entry which is in a title index and not appropriate for the randon feature please. |
@kelson42 Is this a proof that items without titles are treated as though they have titles (https://github.com/openzim/libzim/blob/6.1.8/test/find.cpp#L69-L77): TEST(FindTests, ByTitle)
{
zim::File file ("./test/wikibooks_be_all_nopic_2017-02.zim");
auto article1 = file.findByTitle('-', "j/body.js"); // !!!!!!!!!!!!!!!!
auto article2 = file.findByTitle('A', "index.html");
ASSERT_EQ(article1.getIndex(), 1);
ASSERT_EQ(article2->getIndex(), 7);
} |
@veloman-yunkan "j/body.js" this should definitly never be made available via the title index. Is a bug. |
It's not a bug, it's a feature :) Related to openzim/libzim#397. (However I'm not sure that fixing it is the correct way to fix the random problem. I'm not saying it is the wrong way neither) |
We clearly depend on openzim/libzim#397 to implement this feature that way. Not sure anymore this is the best path to follow. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
@rgaudin @mgautierfr We will need to make a decision/progress on this as it has to work for the Zimit project. |
With @mgautierfr We have decided to create a new "title list" which ultimatively (we will keep it for the moment for backward compatibility) will replace the current TitlePointerList. This new list will contain only the "front" articles (will have to be specified at ZIM creation time). This will done in openzim/libzim#397. Random will in the future be based on it. If nobody has a strong concern, I propose to close this ticket? |
👍 |
At the moment, the Random feature (
Reader::getRandomPage()
) is the result of a random article id inside theA
namespace.https://github.com/kiwix/kiwix-lib/blob/98a10d2ca1b404fca4386d624e6e7f5e342ea416/src/reader.cpp#L211-L232
in the libzim2 branch, @mgautierfr added a check to ensure the retrieved article has a
text/html
mime-type as the random article is taken from the whole articles list, since this branch dropped namespaces.https://github.com/kiwix/kiwix-lib/blob/86fb0cc4a6a1eac7c72aca46e6560bbe686ce0dd/src/reader.cpp#L168-L187
We believe this is not the appropriate behavior as non-
text/html
content could be welcomed here.There is no concept of “this should be included in random” but we do have a concept of “this should be navigable to” and that is the suggestion system which rely on
get_title()
being true which make some semantic sense: if you set a title on an article, you're exposing it.The random feature should thus offer a random article that matches the suggestion criteria. Whether it's possible to reuse the suggestion methods or not is unknown to me though.
The text was updated successfully, but these errors were encountered: