Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support new v1 directory listings #709

Conversation

Jaifroid
Copy link
Member

@Jaifroid Jaifroid commented Mar 4, 2021

This is a draft of support for the new directory listings, corresponding to issue #708.

In line with a suggestion from @mgautierfr , I've implemented it in an extensible way: a developer can add further (future) listings in the ZIMArchive function of zimArchive.js by adding more known listings to the array of Listing objects. Obviously the only two listed there are currently as in code block below. This is very much just-in-case, as there are no plans to extend the listings.

Just note that, for testing with the ZIMs we have available, the code is currently rigged to fill out the articlePtrPos and articleCount metadata from X/listing/titleOrdered/v0 because we don't have a ZIM (at least not the ones I've tested) with a non-empty X/listing/titleOrdered/v1. The rigged code will be changed (a one-letter change) once we have a valid ZIM for testing. It is clearly commented where I've rigged it in zimArchive.js.

I've only done preliminary testing on two nons ZIMs (Ray Charles and Basketball), and on one legacy ZIM (Ray Charles, plus it is passing push tests with a split Ray Charles).

[{
    path: 'X/listing/titleOrdered/v0',
    ptrName: 'titlePtrPos',
    countName: 'entryCount'
},
{
    path: 'X/listing/titleOrdered/v1',
    ptrName: 'articlePtrPos',
    countName: 'articleCount'
}]

@Jaifroid Jaifroid added this to the v3.2 milestone Mar 4, 2021
@Jaifroid Jaifroid self-assigned this Mar 4, 2021
www/js/lib/zimArchive.js Outdated Show resolved Hide resolved
www/js/app.js Show resolved Hide resolved
www/js/lib/zimArchive.js Outdated Show resolved Hide resolved
@Jaifroid
Copy link
Member Author

Jaifroid commented Mar 7, 2021

@mossroy Just to let you know I've got as far as I can with this PR, until we have a test ZIM with a valid (non-empty) v1 listing. (see #708 (comment)). Bear in mind that any test of this PR using one of the test ZIMs in http://tmp.kiwix.org/nons_zims/ will currently act as if it's using a v1 listing, but in fact it is using a v0 listing, as confirmed by the diagnostic console.log entry that is written shortly after loading a new ZIM. The random button will occasionally show non-article assets with such a ZIM.

@Jaifroid Jaifroid force-pushed the Support-new-title-pointer-lists-by-path-X/listing/titleordered/v-x] branch from ab2f52d to 5bd90b1 Compare March 31, 2021 11:13
@Jaifroid
Copy link
Member Author

I've rebased this branch on master, have removed the rigged code, and have introduced a test for an empty title listing (which will be rejected gracefully), as suggested by mgautier. The code will fall back to the next best listing (which is usually the full entryPtrList).

@Jaifroid Jaifroid removed the do-not-merge Sample code label Mar 31, 2021
@Jaifroid
Copy link
Member Author

Quickly tested and working with the NEW wikipedia_fr_basketball_maxi_2019-08_nons.zim on Edge Chromium, IE11 and Firefox ESR (52.5.2). I haven't tested on FFOS.

I think this is ready for review.

@Jaifroid Jaifroid marked this pull request as ready for review March 31, 2021 11:41
Copy link
Contributor

@mossroy mossroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on Firefox OS, with several old ZIM files and wikipedia_fr_basketball_maxi_2019-08_nons.zim
I did not see any issue, except the main page that can not be scrolled on my device (but I doubt it is a regression of this PR)
Congratulations!

I did not have a deep look in the code, but did not see any problem

www/js/lib/zimfile.js Outdated Show resolved Hide resolved
@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 1, 2021

OK, thanks. I need to test on more of the supplied ZIMs.

The Basektball landing page has an issue common to many themed Wkipedia ZIMs in jQuery mode whereby only some of the tiles are loaded. We have had an open issue #552 for this (since August 2019). As you say this is unrelated to this PR. It is to do with the dependency on the 'masonry' JavaScript library to produce these tiled views. It's possible to work around it by preventing some of the CSS from loading, or else we could maybe inject the required scripts into jQuery mode, but that would be a separate support option for Wikipedia-based archives.

@mossroy
Copy link
Contributor

mossroy commented Apr 1, 2021

Forgot to add that, as soon as we have a "no-namespace" reference ZIM file (like an updated version of Ray Charles archive, that would also cover zstd compression?), we should add unit tests on it (while keeping at least a few unit tests on the previous format, to cover compatibility)

@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 1, 2021

Agreed.

@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 3, 2021

I have found one issue with reading the v1 title index of the new Ray Charles no-namespace ZIM. Opening this ZIM and searching for "r" only displays two title entries with this PR: "Ray Charles" and "Ray Charles discography". However, the landing page shows there should be 15 articles beginning with "r" (see screen shot below). I need to check whether this is a problem with this PR or a problem with the coding of this sample ZIM.

@mgautierfr Would you happen to know whether this ZIM only contains a sample subset of the articles in the ZIM?

image

www/js/lib/zimfile.js Outdated Show resolved Hide resolved
@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 3, 2021

So I've found the issue that is preventing search from finding some of the title listings. The issue is that the title field of the Directory Entries contains html markup in some cases, e.g. <i>Title</i> (album) (see screenshot of sample dirEntry below, and notice the italic tags in the title field). Of course this means a) that these titles are sorted in the v1 listing under < instead of under the letter of the first semantic word in the title; and b) that if we search for a title using the first word of the title, it won't be found. The missing titles can in fact be searched for by typing < in the search field!

@mgautierfr I assume this is a format error in the sample Ray Charles ZIM, and that you are not intending to implement rich text formatting for directory titles?

image

Result of searching for < in search field:

image

@kelson42
Copy link
Collaborator

kelson42 commented Apr 3, 2021

@Jaifroid Looks like a bug in MWoffliner, similar to openzim/mwoffliner#605. Could you please open a ticket in openzim/mwoffliner and give a recent file where this problem occurs?

@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 3, 2021

@kelson42 To be clear, this PR is for reading the NEW sample no-namespace ZIM archives with v1 title listings that @mgautierfr has uploaded to http://tmp.kiwix.org/nons_zims/ for us to test this PR on. I believe these have not been produced with mwoffliner, but with a conversion script. However, if you think they have been produced my mwoffliner I'd be happy to open a ticket. I am only seeing this with the sample Ray Charles ZIM provided (I haven't tested for this particular issue in others).

@kelson42
Copy link
Collaborator

kelson42 commented Apr 3, 2021

@Jaifroid I understand that these zim files have been recreated with zim tool "zimrecreate". But at the origin, there is a ZIM file created with MWoffliner and zimrecreate has for sure not "invented" the DOM node in the title. I believe you probably have the same bug already with the original zim file and this would not be a regression then. This should be verified.

@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 3, 2021

@kelson42 I've just tested, and this problem is not in the latest Ray Charles (March 2021). I think the sample Ray Charles no-namespace ZIM has been made with an old Ray Charles that may well have had the old openzim/mwoffliner#605 issue (or similar). Since it's not in the current Ray Charles, then I think we can assume it's fixed.

@kelson42
Copy link
Collaborator

kelson42 commented Apr 3, 2021

@mgautierfr Would be nice if you could just confirm this (and redo the test ZIM file based on a recent version).

@mgautierfr
Copy link
Member

As the ray_charles name implies, it is based on a 2018-09 zim file, which is pretty old.
I will do another nons file with a recent ray_charles zim.

@mgautierfr
Copy link
Member

New http://tmp.kiwix.org/nons_zims/wikipedia_en_ray_charles_maxi_2021-03_nons.zim uploaded.

@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 6, 2021

@mgautierfr Thank you! That was the missing piece in the puzzle and confirms that this PR is working correctly. There are no titles in the article pointer list beginning with an html tag, and a search for 'r' yields the expected articles. For information, a readout of the ZIM heade after reading the v1 title index is in the screenshot below.

image

@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 7, 2021

In addition to previous tests, I've now tested on FFOS simulator and IE11 with the newest no-namespace Ray Charles ZIM. Everything seems to be working as expected, so I'll go ahead and merge now.

@mossroy I don't consider the latest Ray Charles to be a "reference version", because it was created with ZIMRecreate rather than with mwoffliner. Therefore I don't think we're ready yet to add tests specific to a new Ray Charles ZIM. As it may represent some work, it would be better to add tests on the first valid no-namespace Ray Charles produced by MWOffliner with updated libzim code.

@mossroy
Copy link
Contributor

mossroy commented Apr 7, 2021

All right, I agree

@Jaifroid Jaifroid merged commit 2d792b9 into master Apr 7, 2021
@Jaifroid Jaifroid deleted the Support-new-title-pointer-lists-by-path-X/listing/titleordered/v-x] branch April 7, 2021 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support version 1 of titlePtrList accessed from X/listing/titleOrdered/v1
4 participants