Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random should be based off suggestions' pool #404

Closed
rgaudin opened this issue Aug 6, 2020 · 20 comments · Fixed by #446
Closed

Random should be based off suggestions' pool #404

rgaudin opened this issue Aug 6, 2020 · 20 comments · Fixed by #446

Comments

@rgaudin
Copy link
Member

rgaudin commented Aug 6, 2020

At the moment, the Random feature (Reader::getRandomPage()) is the result of a random article id inside the A namespace.

https://github.com/kiwix/kiwix-lib/blob/98a10d2ca1b404fca4386d624e6e7f5e342ea416/src/reader.cpp#L211-L232

in the libzim2 branch, @mgautierfr added a check to ensure the retrieved article has a text/html mime-type as the random article is taken from the whole articles list, since this branch dropped namespaces.

https://github.com/kiwix/kiwix-lib/blob/86fb0cc4a6a1eac7c72aca46e6560bbe686ce0dd/src/reader.cpp#L168-L187

We believe this is not the appropriate behavior as non-text/html content could be welcomed here.

There is no concept of “this should be included in random” but we do have a concept of “this should be navigable to” and that is the suggestion system which rely on get_title() being true which make some semantic sense: if you set a title on an article, you're exposing it.

The random feature should thus offer a random article that matches the suggestion criteria. Whether it's possible to reuse the suggestion methods or not is unknown to me though.

@kelson42
Copy link
Collaborator

kelson42 commented Aug 6, 2020

@veloman-yunkan Any opinion on this? Would you know how to make a random feature based on the title index?

@veloman-yunkan
Copy link
Collaborator

@kelson42 I will look into it

@veloman-yunkan
Copy link
Collaborator

@kelson42

Unfortunately, currently in the title index the articles without a title participate with their path (url) used instead of the title. As a result articles with real titles are not guaranteed to form a continuous range. Because of this an effective and effiecient approach for randomly picking up a titled article is not possible (without creating an additional index when the ZIM file is opened). The current algorithm can be trivially modified to only return articles with a title, however it may fail to do so within the set limit on the number of attempts (which is now equal to 42).

@mgautierfr Do you plan to change in libzim2 the way the title index is built (so that articles without titles are excluded from it)?

@kelson42
Copy link
Collaborator

kelson42 commented Aug 9, 2020

We are again impacted by openzim/libzim#208 here. We need a way to go through the title index without any side effect of the url index. @veloman-yunkan Only articles with titles given during the ZIM writting process will be returned and this is what we want. Would you be able to give a concrete example in which the random search in the title index does no give the expected result?

@kelson42
Copy link
Collaborator

kelson42 commented Aug 9, 2020

Actuallyimplementing should be trivial, just computes the number of entries base on the offset of urlpointerlst and titlepointerkst divided by 4. Then find a number in that range, read the 4 bytes entry.

@veloman-yunkan
Copy link
Collaborator

veloman-yunkan commented Aug 9, 2020

@kelson42

Only articles with titles given during the ZIM writting process will be returned and this is what we want.

No, because the title index in the ZIM file is sorted according to title-or-url-if-no-title. See:

https://github.com/openzim/libzim/blob/6.1.8/src/writer/_dirent.h#L81

https://github.com/openzim/libzim/blob/6.1.8/src/writer/_dirent.h#L170-L174

@kelson42
Copy link
Collaborator

kelson42 commented Aug 9, 2020

This is true but only only the entries which have a url=title are in the title index, so this does not matter here. All of this is just an optim to avoid to store the same string two times in the dirent.

@veloman-yunkan
Copy link
Collaborator

only the entries which have a url=title are in the title index

I doubt it. https://github.com/openzim/libzim/blob/6.1.8/src/fileimpl.cpp#L377-L386 suggests that the size of the tile index is equal to the full count of articles, i.e. all articles are in the title index.

@veloman-yunkan
Copy link
Collaborator

@kelson42
Copy link
Collaborator

kelson42 commented Aug 9, 2020

Ultimatively, the ZIM creator should decide what comes in the title index... but for the moment no image should be in any title index. It seems you are able to find them, so please can you open a bug with the details, because this is not normal.

@veloman-yunkan
Copy link
Collaborator

Shouldn't openzim/libzim#208 be reopened instead?

@kelson42
Copy link
Collaborator

kelson42 commented Aug 9, 2020

Not as long that this no proven to be a probleme here. Please give a concrete example of an entry which is in a title index and not appropriate for the randon feature please.

@veloman-yunkan
Copy link
Collaborator

@kelson42 Is this a proof that items without titles are treated as though they have titles (https://github.com/openzim/libzim/blob/6.1.8/test/find.cpp#L69-L77):

TEST(FindTests, ByTitle)
{
    zim::File file ("./test/wikibooks_be_all_nopic_2017-02.zim");

    auto article1 = file.findByTitle('-', "j/body.js"); // !!!!!!!!!!!!!!!!
    auto article2 = file.findByTitle('A', "index.html");
    ASSERT_EQ(article1.getIndex(), 1);
    ASSERT_EQ(article2->getIndex(), 7);
}

@kelson42
Copy link
Collaborator

kelson42 commented Aug 9, 2020

@veloman-yunkan "j/body.js" this should definitly never be made available via the title index. Is a bug.

@mgautierfr
Copy link
Member

"j/body.js" this should definitly never be made available via the title index. Is a bug.

It's not a bug, it's a feature :)

Related to openzim/libzim#397. (However I'm not sure that fixing it is the correct way to fix the random problem. I'm not saying it is the wrong way neither)

@kelson42
Copy link
Collaborator

We clearly depend on openzim/libzim#397 to implement this feature that way. Not sure anymore this is the best path to follow.

@stale
Copy link

stale bot commented Oct 10, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Oct 10, 2020
@kelson42
Copy link
Collaborator

@rgaudin @mgautierfr We will need to make a decision/progress on this as it has to work for the Zimit project.

@kelson42
Copy link
Collaborator

With @mgautierfr We have decided to create a new "title list" which ultimatively (we will keep it for the moment for backward compatibility) will replace the current TitlePointerList. This new list will contain only the "front" articles (will have to be specified at ZIM creation time). This will done in openzim/libzim#397. Random will in the future be based on it.

If nobody has a strong concern, I propose to close this ticket?

@kelson42 kelson42 assigned mgautierfr and kelson42 and unassigned mgautierfr Dec 12, 2020
@rgaudin
Copy link
Member Author

rgaudin commented Dec 12, 2020

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants