Support version 1 of titlePtrList accessed from X/listing/titleOrdered/v1 #708

Jaifroid · 2021-02-27T11:45:38Z

Having completed #698 I'm looking into how to support the new titePtrList. I note that libzim has implemented a getTitleAccessor() function:

https://github.com/openzim/libzim/blob/master/src/fileimpl.cpp#L158

There are two versions of the titlePtrList, one of which is X/listing/titleOrdered/v0, which is the traditional one that is also accessed via the titlePtrPos byte contained in the ZIM file header. The other is X/listing/titleOrdered/v1, which only contains article titles, and so this is the one we need to use for title search and random articles.

I'm having trouble getting my head around how we can "bootstrap" the app to the point where it can read a title from the X namespace (in order to retireve these new titlePtrLists without first accessing the traditional titlePtrPos that is in the archive header, but which is now considered obsolete, even though it points to the same offset represented by X/listing/titleOrdered/v0.

Can I use the urlPtrPos to binary search the urlPtrList instead? I can't find code corresponding to this in libZim (but the fact that I can't find it means nothing: I don't know C++ well enough).

The text was updated successfully, but these errors were encountered:

Jaifroid · 2021-02-28T10:01:42Z

@mgautierfr Is it safe for our code to make the following assumptions in order to find X/listing/titleOrdered/v0 and X/listing/titleOrdered/v1 while loading a new archive, without using the obsolete titlePtrPos contained in the archive header?:-

There will never be a namespace below X in a ZIM (in practical terms, assuming ASCII, we will not have a namespace Z)
Therefore the dirEntries for X/listing/titleOrdered/... will always be the last few dirEntries (currently the last three, if there is a Xapian index)
It would be safe to iterate through the last 10 (say) entries in urlPtrList to find the two entries for X/listing/titleOrdered/...

If the above assumptions are not safe ones to make, how do you recommend we find these listings without using titlePtrPos? Must we initiate a binary search through the urlPtrList?

A final query is why you cannot undertake to keep a meaningful value in the titlePtrPos byte of the ZIM archive header. It is marked as obsolete, but as I understand it, it currently just points to the start of X/listing/titleOrdered/v0 which is a very useful and quick way to find this list. Why can't this continue to be the case, without rendering the titlePtrPos byte obsolete? If a future ZIM file no longer has the v0 title index, then the header byte could simply point to the v1 index instead.

Jaifroid · 2021-02-28T18:31:13Z

Sorry for bombarding @mgautierfr, but do any of the ZIM archives in http://tmp.kiwix.org/nons_zims/ have these v0 and v1 indices? The only things I can find in the X namespace for the "nons" ZIMs I've tested are:

fulltext/xapian
title/xapian

I tested this by forcibly setting the search namespace to X in Kiwix JS Windows. This should only fail to show the listings if the archive's entryCount byte (in the archive header) is incorrectly set and does not include these listings in the count of entries. If that is the case, then how do I locate them?

mgautierfr · 2021-03-01T15:09:03Z

I'm having trouble getting my head around how we can "bootstrap" the app to the point where it can read a title from the X namespace (in order to retireve these new titlePtrLists without first accessing the traditional titlePtrPos that is in the archive header, but which is now considered obsolete, even though it points to the same offset represented by X/listing/titleOrdered/v0.

Must we initiate a binary search through the urlPtrList?

Yes.

X/listing/titleOrdered/v0 (and .../v1) are path/url. (They don't have any title)

You must setup your "code/structure/..." to have to use urlPtrPos to have get urlPtrList available.
Then you can "simply" search by path to find the title listing entry.
Setup the "title searching feature" with the found entry.
Then you are good.

There will never be a namespace below X in a ZIM (in practical terms, assuming ASCII, we will not have a namespace Z)

No. We don't plan to add another namespace for now but it is something possible.

but do any of the ZIM archives in http://tmp.kiwix.org/nons_zims have these v0 and v1 indices?

The first version of the zim files didn't have these v0 and v1 indices.
I've upload a new version "recently" (#597 (comment)) with the indices.
If not this is a issue on my side.

Jaifroid · 2021-03-01T15:26:17Z

I've upload a new version "recently" (#597 (comment)) with the indices.
If not this is a issue on my side.

Thanks for the reply @mgautierfr. I was indeed using those new versions (I checked the date), and I'm afraid I can only find the two Xapian indices in the X namespace (using urlPtrList), unless I'm doing something wrong. As I said, unless the total count of entries in the archive header is wrong (in which case the binary search function will not find any entries after these two), then these ZIMs do not appear to contain the new titlePtr lists.

On your other comments, thanks for the clarification that we should use the urlPtrList to find these entries, and that we should do that with a binary search on the list. (It does seem to add some complexity, and IMHO it would be more efficient to retain a meaningful entry for the titlePtrPos in the spec, but there's probalby a reason I'm not aware of as to why that can't be done).

mgautierfr · 2021-03-01T15:52:54Z

I was indeed using those new versions (I checked the date), and I'm afraid I can only find the two Xapian indices in the X namespace (using urlPtrList), unless I'm doing something wrong

I will check.

and IMHO it would be more efficient to retain a meaningful entry for the titlePtrPos in the spec, but there's probalby a reason I'm not aware of as to why that can't be done

I'm not sure to understand what you are suggesting.
titlePtrPos is still in the header.
We can't make this field points to the new titlelist (v1) as there is nothing to tell the reader how many entries are in the list. So this list MUST contains all dirents. (len(titlePtrList) == entryCount)
We can of course change the format. But then old implementation will not be able to read new zim. And introduce the fact that we use urlPtrList to locate the indexes (or wellknown entries) allow us to evolve easily
(we can add new entry and keep the old one, without breaking any implementation).

Jaifroid · 2021-03-01T16:01:06Z

@mgautierfr OK, that makes sense -- I hadn't thought about that issue (entryCount) -- so that's the reason I wasn't seeing!
A new entry in the header for the length of titlePtrLstV2 would be a quick way to access this list without having to do a binary search early on in the ZIM loading process. Your call! If we don't have it, I'll write the code for the binary search on X namespace (well, I can leverage existing binary search code).

mgautierfr · 2021-03-01T18:20:34Z

I will check.

I have check and I was wrong.
I've update new zim files with X/listing/titleOrdered/v[01].

A new entry in the header for the length of titlePtrLstV2 would be a quick way to access this list without having to do a binary search early on in the ZIM loading process. Your call! If we don't have it, I'll write the code for the binary search on X namespace (well, I can leverage existing binary search code).

It would go in the opposite direction I want to go :)
I would like to make the "minimal zim file" smaller as possible. title listing, title index, fulltext index, wellknown entries, new features ... would be optional and discoverable with the existence of the corresponding entry in the zim file.
It is not for tomorow (if it is one day at all), but I will not add new things in the header.

to access this list without having to do a binary search early on in the ZIM loading process.

There is very few use case where we open a zim file without doing several binary searches. Just displaying a list of available zim file imply to do several binary searches to get the title/author/date/information/articlecount/favicon/....
One extra binary search will not really slow the whole process.

Jaifroid · 2021-03-01T18:27:53Z

Thanks for updating the ZIM archives!

I completely understand about not wanting to extend the header. My concern was only because binary search in JavaScript is much slower than in C++,, though we've done a lot to optimize it. But it's perhaps more the overhead of writing the support for this feature... NB we don't do binary searches on unopened archives (not technically possibly in JS in any case, except in rare cases where a user has picked a folder using the File System Access API and would be ridiculously slow/memory intensive).

In any case, it should not be too hard now to code for this. Thanks for your help, @mgautierfr!

Jaifroid · 2021-03-02T08:11:48Z

Success in locating the new indices 😊 :

Jaifroid · 2021-03-02T18:40:50Z

@mgautierfr Just to let you know that in two of the nons ZIMs I've tested so far, Ray Charles and Basketball, the X/listing/titleOrdered/v1 exists but appears to be empty. X/listing/titleOrdered/v0, when accessed from the path, contains the expected data. I can work with this for now, as all I need to do is to set up our reader to use the latter, and then it will be easy to adapt it to use the former if it exists. But I thought I should let you know.

Jaifroid · 2021-03-02T23:04:51Z

Assumptions -> justifications:

A titleOrdered listing exists inside a single uncompressed cluster -> fileimp.cpp#L167 has comment "this is a ZIMFile format error" if the dirEntry's cluster is compressed;
A titleOrdered listing exists as a single BLOB contained in the dirEntry's cluster -> fileimpl.cpp#L172 gets the offset of a single BLOB
The titleOrdered listing's size needs to be calculated by subtracting the pointer of BLOB number n (containing the listing) from the pointer for BLOB n + 1 (which might be the next blob, or might be the end of the cluster's data) -> cluster.cpp#L128
The number of entries in a titleOrdered listing = listingSize / 4 (we need this so that we can look up titlePtrList entries in urlPtrList by index number)
For titleOrdered/v0 these calculations will always yield the same values as are already given in the ZIM archive header for titlePtrPos and entryCount (but in a future ZIM version they could hypothetically be removed)
For titleOrdered/v1 these calculations will yield different values, with a number of entries smaller than the archive header's entryCount (except, hypothetically, for a ZIM that only contains articles and no assets).

mgautierfr · 2021-03-03T15:42:08Z

All your assumption are true. Everything is in the spec : https://wiki.openzim.org/wiki/Search_indexes

titleOrdered listing (as xapian indexes) are items (articles) stored in cluster, as any other items.
titleOrdered has mimetype application/octet-stream+zimlisting which is not compressible.
So they are stored, in a uncompressed cluster, in a single blob. The size of the item is the offset(n+1)-offset(n) (as all other items)

titleOrdered/v0 exists only to have some kind of coherence in the spec while keeping compatibility.
We will have a titlePtrPos. It is not a huge lost to add a dirent/cluster around the data to homogenize how we locate the data.

We may add another listing in the future (case insensitive ordered, media listing, whatever).
We have no plan at all for now so no worry about this. But it may be a good idea to design your code to have some kind of optional listing if you can do it without extra cost.

Jaifroid · 2021-03-03T15:56:27Z

Thank you @mgautierfr. We can easily have an array of listings fed to an accessor function that returns the start pointer and the number of entries of each listing.

Jaifroid · 2021-03-07T17:00:21Z

@mgautierfr I have basically finished coding support for ZIMs with v1 article-only title pointer lists in #709. Am I right that the test ZIMs you updated on 1st March have empty data in X/listing/titleOrdered/v1? At least Ray Charles, Basketball and Wikispecies are all returning a pointer list size of 0 for the v1 listing (the v0 listing is correct).

mgautierfr · 2021-03-30T15:42:58Z

I've just upload a new version of zim files. Previous one was generated with a version of zim-recreate without this change, so no items was in the v1 title list.

It should be good now. v1 listing should not be zero sized (but it is something you should be prepared for (at least discard the listing, fail gracefully, ...))

Jaifroid · 2021-03-31T10:49:25Z

@mgautierfr Thank you very much for the updated archives. I shall incorporate a test for a listing with no entries and fall back to the next best index in that case.

Jaifroid · 2021-07-15T11:58:11Z

Following on from #722, I've determined the issue (I think) that is causing the v1 listing to fail in the beer stackexchange ZIM. Although this ZIM contains a listing X/listing/titleOrdered/v1, its metadata indicates that its size is 0 (see screenshot). At this point in the code I have a comment:

// Note that we do not accept a listing if its size is 0, i.e. if it contains no data
// (although this should not occur, we have been asked to handle it - see kiwix-js #708)

Hence, there appears to be something wrong with the listing. @mgautierfr the relevant comment of yours about this is #708 (comment) . Do you know why this production ZIM appears to have a size of 0 for its v1 listing?

Jaifroid · 2021-07-15T12:20:16Z

Adding some further diagnotic info: the Directory Entry for this listing, together with its metadata:

Jaifroid · 2021-07-15T12:29:47Z

@mossroy I've added diagnostics enabling you to open this beer_stackexchange ZIM and see above output in console log (include debug level) in this branch: https://github.com/kiwix/kiwix-js/tree/Diagnostics-for-no-ns-ZIMs . You'll see there are several other errors, including not-found images and stylesheets, but I'm focusing on getting the Listings correct before looking at any of those. It does look like a dodgy ZIM, though!

rgaudin · 2021-07-15T12:33:02Z

@Jaifroid saving you some time ; those new sotoki Zims use a <base /> node so every link and resources are relative to that base. So Should you have a non-base-aware check (like the zimcheck's regex-only one), you would find invalid links.

As it's a close to zero effort to not use <base /> versus an heavy one to use a DOM+ parser on zimcheck, we'll probably change that in the near future.

Jaifroid · 2021-07-15T13:32:08Z

Thanks, @rgaudin that's helpful for the style and image issues I notice. As far as I can tell, the landing page of this ZIM has <base href(unknown) /> which is odd (see screenshot). Could it be that you populate this dynamically? If so, that's a bad idea as we have to support reading a ZIM without running JavaScript inside the ZIM (in contexts where Service Worker is not supported, we cannot run dynamic content from the ZIM).

But it gets worse: in the mode that does support dynamic content (we call it "Service Worker mode"), the landing page of this ZIM seems to perform a top-level navigation which overwrites our software, and leaves us with only the landing page loaded... Obviously no links then work at all, as our software is no longer present to intercept Fetches...

rgaudin · 2021-07-15T17:18:00Z

Hum, no. here's what I get from the Zim, which corresponds to what to put inside:

import libzim.reader
zim = libzim.reader.Archive("/Users/reg/data/stackexchange/beer_stackexchange_com_2021-07.zim")
zim.main_entry
# Entry(url=mainPage, title=mainPage)
bytes(zim.main_entry.get_item().content).decode("UTF-8")[:200]
# '<!DOCTYPE html>\n<html class="html__responsive  html__fixed-top-bar">\n    <head>\n        <meta charset="utf-8">\n        <base href="">\n        <title>Highest Voted Questions - Beer, Wine and Spirits St'

As you can see, the actual HTML code is <base href="">.

As mentioned in my previous comment, I understand working with <base /> can be tricky for other tools so I'll update the scraper not to use it. I'll send you another Zim so you can focus on the other problem.

Jaifroid · 2021-07-15T17:26:20Z

Thanks! I was showing you the HTML of the rendered page, rather than the input string. But reverting to the standard ZIM hyperlink conventions (i.e. relative to current URL, including when the URL is inside subdirectories which are common in Stackexchange) would be very helpful.

mossroy · 2021-07-16T15:35:12Z

It's possible that the <base/> tag causes issues, at least in jQuery mode. If it's possible to avoid it, it would be better.

In ServiceWorker mode, our UI disappears when opening this ZIM file. There must be a javascript code that detects it is run inside our iframe, and moves itself in the enclosing window (replacing our UI). It's a "protection" that some websites use

rgaudin · 2021-07-16T15:47:47Z

OK, we do include some JS from both MathJax (doubt it uses this technique) and StackExchange (so we can use their highlighting feature). Both should be conditional and not enabled for this particular ZIM (we have tickets for that).
Will send you a base-less Zim soon and will look at removing the JS as well for beer.

Jaifroid · 2021-07-16T15:51:33Z

@rgaudin The MathJax module works fine with Kiwix JS in current Stackexchange ZIMs, so it's definitely not that one causing the issue. I doubt highlighting would cause the break out of the iframe (but you never know!)...

rgaudin · 2021-07-16T15:53:35Z

The thing is we don't use highlight.js directly but fire the whole Stack JS toolbox and tell it to enable highlighting so it's automatically configured properly. That JS code dynamically imports a bunch of other stuff.

rgaudin · 2021-07-21T18:27:17Z

@mossroy @Jaifroid here are a couple new Zims without the base:

Let me know if you need anything else

Jaifroid · 2021-07-22T08:27:22Z

@rgaudin Thank you very much. The removal of the base tag definitely helps. However, there are two issues remaining:

Both of these ZIMs have invalid metadata in the ZIM header: the header indicates the archive's minorVersion property is 1, and that they therefore should use the new v1 title pointer lists, but X/listing/titleOrdered/v1 has a size of 0 (or at least its size is declared as zero in the dirEntry metadata), and so it can't be used. The reader has to fall back to using the old v0 title pointer list. See screenshot below for diagnostic. I think this is an issue for @mgautierfr.
The landing page of the ZIM still breaks out of the iframe we enclose it in and overwrites itself at the top level, destroying our UI and making the ZIM unusable in our Service Worker mode. We have a legacy way of reading the archive which can read it because it disables JavaScript in the ZIM, but it would be much better if the offending script were removed from the ZIM, rather than us having to disable all JS in the ZIM.

rgaudin · 2021-07-22T08:30:18Z

Will look into the JS issue but the version one is probably at (py?)libzim level and for @mgautierfr. He'a away this week though.

Jaifroid · 2021-07-22T08:32:40Z

Yes, it's definitely a libzim issue. There's discussion of handling this very error above, even though it shouldn't happen!

Jaifroid · 2021-07-22T08:34:42Z

Will look into the JS issue but the version one is probably at (py?)libzim level and for @mgautierfr. He'a away this week though.

If I can help in identifying which included script is causing the issue, let me know.

kelson42 · 2021-07-22T09:29:47Z

@rgaudin @Jaifroid Please open ticket at libzim level.

Jaifroid · 2021-07-22T09:39:26Z

@rgaudin @Jaifroid Please open ticket at libzim level.

I opened openzim/libzim#590.

rgaudin · 2021-07-30T18:41:13Z

Will look into the JS issue

@Jaifroid sorry for the long delay. Here are a 2MB zim and a 15MB zim without the extraneous JS. All the JS included in this one are so on purpose. If there's still a problem, let me know.

Still nothing in the titlelist though

rgaudin · 2021-08-03T18:05:06Z

@Jaifroid this new zim should now have the title list filled as it is now using FRONT_ARTICLE Hints.

Jaifroid · 2021-08-05T04:34:44Z

@rgaudin I confirm that this latest ZIM now has 221 title entries in the v1 title index (out of 557 total dirEntries). I also confirm that the landing page no longer attempts to break out of the iframe, so this ZIM can be used perfectly in Kiwix JS 😊. Diagnostic screenshot below.

kelson42 · 2021-08-05T06:14:10Z

@Jaifroid We ca close this ticket then I guess?

Jaifroid · 2021-08-05T13:53:49Z

@kelson42 Yes, I think our support for this is now complete.

mossroy · 2021-08-20T12:49:45Z

@rgaudin : I see some issues specifically with http://tmp.kiwix.org/alcohol_meta_stackexchange_com_2021-08.zim : I suppose they should be reported on https://github.com/openzim/sotoki ?

rgaudin · 2021-08-20T13:12:26Z

that's right

Jaifroid added the enhancement label Feb 27, 2021

Jaifroid added this to the v3.2 milestone Feb 27, 2021

Jaifroid self-assigned this Feb 27, 2021

Jaifroid mentioned this issue Feb 27, 2021

Support reading ZIM archives with no namespace #597

Closed

Jaifroid mentioned this issue Mar 4, 2021

Support new v1 directory listings #709

Merged

Jaifroid linked a pull request Apr 7, 2021 that will close this issue

Support new v1 directory listings #709

Merged

Jaifroid closed this as completed in #709 Apr 7, 2021

Jaifroid added a commit that referenced this issue Apr 7, 2021

Support new v1 directory listings #708 (#709)

2d792b9

Jaifroid reopened this Jul 15, 2021

Jaifroid mentioned this issue Jul 15, 2021

Release 3.2.0 #722

Closed

13 tasks

Jaifroid mentioned this issue Jul 19, 2021

Correct usage of classList method #738

Merged

Jaifroid mentioned this issue Jul 22, 2021

ZIMs made with newest version of libzim have invalid X/listing/titleOrdered/v1 data openzim/libzim#590

Closed

Jaifroid added the backend label Jul 22, 2021

mgautierfr mentioned this issue Aug 4, 2021

Add support for Hints in python-libzim openzim/python-libzim#115

Merged

Jaifroid closed this as completed Aug 5, 2021

Jaifroid mentioned this issue Jul 29, 2023

Remove old (v0) titleindex (title pointer list) openzim/libzim#800

Open

Support version 1 of titlePtrList accessed from X/listing/titleOrdered/v1 #708

Support version 1 of titlePtrList accessed from X/listing/titleOrdered/v1 #708

Comments

Jaifroid commented Feb 27, 2021 • edited Loading

Jaifroid commented Feb 28, 2021 • edited Loading

Jaifroid commented Feb 28, 2021 • edited Loading

mgautierfr commented Mar 1, 2021

Jaifroid commented Mar 1, 2021

mgautierfr commented Mar 1, 2021

Jaifroid commented Mar 1, 2021

mgautierfr commented Mar 1, 2021

Jaifroid commented Mar 1, 2021

Jaifroid commented Mar 2, 2021

Jaifroid commented Mar 2, 2021

Jaifroid commented Mar 2, 2021 • edited Loading

mgautierfr commented Mar 3, 2021

Jaifroid commented Mar 3, 2021

Jaifroid commented Mar 7, 2021

mgautierfr commented Mar 30, 2021

Jaifroid commented Mar 31, 2021

Jaifroid commented Jul 15, 2021

Jaifroid commented Jul 15, 2021 • edited Loading

Jaifroid commented Jul 15, 2021

rgaudin commented Jul 15, 2021

Jaifroid commented Jul 15, 2021

rgaudin commented Jul 15, 2021

Jaifroid commented Jul 15, 2021

mossroy commented Jul 16, 2021

rgaudin commented Jul 16, 2021

Jaifroid commented Jul 16, 2021

rgaudin commented Jul 16, 2021

rgaudin commented Jul 21, 2021

Jaifroid commented Jul 22, 2021 • edited Loading

rgaudin commented Jul 22, 2021

Jaifroid commented Jul 22, 2021

Jaifroid commented Jul 22, 2021

kelson42 commented Jul 22, 2021

Jaifroid commented Jul 22, 2021

rgaudin commented Jul 30, 2021 • edited Loading

rgaudin commented Aug 3, 2021

Jaifroid commented Aug 5, 2021

kelson42 commented Aug 5, 2021

Jaifroid commented Aug 5, 2021

mossroy commented Aug 20, 2021

rgaudin commented Aug 20, 2021

Jaifroid commented Feb 27, 2021 •

edited

Loading

Jaifroid commented Feb 28, 2021 •

edited

Loading

Jaifroid commented Feb 28, 2021 •

edited

Loading

Jaifroid commented Mar 2, 2021 •

edited

Loading

Jaifroid commented Jul 15, 2021 •

edited

Loading

Jaifroid commented Jul 22, 2021 •

edited

Loading

rgaudin commented Jul 30, 2021 •

edited

Loading