Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support version 1 of titlePtrList accessed from X/listing/titleOrdered/v1 #708

Closed
Jaifroid opened this issue Feb 27, 2021 · 41 comments · Fixed by #709
Closed

Support version 1 of titlePtrList accessed from X/listing/titleOrdered/v1 #708

Jaifroid opened this issue Feb 27, 2021 · 41 comments · Fixed by #709
Assignees
Milestone

Comments

@Jaifroid
Copy link
Member

Jaifroid commented Feb 27, 2021

Having completed #698 I'm looking into how to support the new titePtrList. I note that libzim has implemented a getTitleAccessor() function:

https://github.com/openzim/libzim/blob/master/src/fileimpl.cpp#L158

There are two versions of the titlePtrList, one of which is X/listing/titleOrdered/v0, which is the traditional one that is also accessed via the titlePtrPos byte contained in the ZIM file header. The other is X/listing/titleOrdered/v1, which only contains article titles, and so this is the one we need to use for title search and random articles.

I'm having trouble getting my head around how we can "bootstrap" the app to the point where it can read a title from the X namespace (in order to retireve these new titlePtrLists without first accessing the traditional titlePtrPos that is in the archive header, but which is now considered obsolete, even though it points to the same offset represented by X/listing/titleOrdered/v0.

Can I use the urlPtrPos to binary search the urlPtrList instead? I can't find code corresponding to this in libZim (but the fact that I can't find it means nothing: I don't know C++ well enough).

@Jaifroid
Copy link
Member Author

Jaifroid commented Feb 28, 2021

@mgautierfr Is it safe for our code to make the following assumptions in order to find X/listing/titleOrdered/v0 and X/listing/titleOrdered/v1 while loading a new archive, without using the obsolete titlePtrPos contained in the archive header?:-

  • There will never be a namespace below X in a ZIM (in practical terms, assuming ASCII, we will not have a namespace Z)
  • Therefore the dirEntries for X/listing/titleOrdered/... will always be the last few dirEntries (currently the last three, if there is a Xapian index)
  • It would be safe to iterate through the last 10 (say) entries in urlPtrList to find the two entries for X/listing/titleOrdered/...

If the above assumptions are not safe ones to make, how do you recommend we find these listings without using titlePtrPos? Must we initiate a binary search through the urlPtrList?

A final query is why you cannot undertake to keep a meaningful value in the titlePtrPos byte of the ZIM archive header. It is marked as obsolete, but as I understand it, it currently just points to the start of X/listing/titleOrdered/v0 which is a very useful and quick way to find this list. Why can't this continue to be the case, without rendering the titlePtrPos byte obsolete? If a future ZIM file no longer has the v0 title index, then the header byte could simply point to the v1 index instead.

@Jaifroid
Copy link
Member Author

Jaifroid commented Feb 28, 2021

Sorry for bombarding @mgautierfr, but do any of the ZIM archives in http://tmp.kiwix.org/nons_zims/ have these v0 and v1 indices? The only things I can find in the X namespace for the "nons" ZIMs I've tested are:

  • fulltext/xapian
  • title/xapian

I tested this by forcibly setting the search namespace to X in Kiwix JS Windows. This should only fail to show the listings if the archive's entryCount byte (in the archive header) is incorrectly set and does not include these listings in the count of entries. If that is the case, then how do I locate them?

image

@mgautierfr
Copy link
Member

I'm having trouble getting my head around how we can "bootstrap" the app to the point where it can read a title from the X namespace (in order to retireve these new titlePtrLists without first accessing the traditional titlePtrPos that is in the archive header, but which is now considered obsolete, even though it points to the same offset represented by X/listing/titleOrdered/v0.

Must we initiate a binary search through the urlPtrList?

Yes.

X/listing/titleOrdered/v0 (and .../v1) are path/url. (They don't have any title)

  • You must setup your "code/structure/..." to have to use urlPtrPos to have get urlPtrList available.
  • Then you can "simply" search by path to find the title listing entry.
  • Setup the "title searching feature" with the found entry.
  • Then you are good.

There will never be a namespace below X in a ZIM (in practical terms, assuming ASCII, we will not have a namespace Z)

No. We don't plan to add another namespace for now but it is something possible.

but do any of the ZIM archives in http://tmp.kiwix.org/nons_zims have these v0 and v1 indices?

The first version of the zim files didn't have these v0 and v1 indices.
I've upload a new version "recently" (#597 (comment)) with the indices.
If not this is a issue on my side.

@Jaifroid
Copy link
Member Author

Jaifroid commented Mar 1, 2021

I've upload a new version "recently" (#597 (comment)) with the indices.
If not this is a issue on my side.

Thanks for the reply @mgautierfr. I was indeed using those new versions (I checked the date), and I'm afraid I can only find the two Xapian indices in the X namespace (using urlPtrList), unless I'm doing something wrong. As I said, unless the total count of entries in the archive header is wrong (in which case the binary search function will not find any entries after these two), then these ZIMs do not appear to contain the new titlePtr lists.

On your other comments, thanks for the clarification that we should use the urlPtrList to find these entries, and that we should do that with a binary search on the list. (It does seem to add some complexity, and IMHO it would be more efficient to retain a meaningful entry for the titlePtrPos in the spec, but there's probalby a reason I'm not aware of as to why that can't be done).

@mgautierfr
Copy link
Member

I was indeed using those new versions (I checked the date), and I'm afraid I can only find the two Xapian indices in the X namespace (using urlPtrList), unless I'm doing something wrong

I will check.

and IMHO it would be more efficient to retain a meaningful entry for the titlePtrPos in the spec, but there's probalby a reason I'm not aware of as to why that can't be done

I'm not sure to understand what you are suggesting.
titlePtrPos is still in the header.
We can't make this field points to the new titlelist (v1) as there is nothing to tell the reader how many entries are in the list. So this list MUST contains all dirents. (len(titlePtrList) == entryCount)
We can of course change the format. But then old implementation will not be able to read new zim. And introduce the fact that we use urlPtrList to locate the indexes (or wellknown entries) allow us to evolve easily
(we can add new entry and keep the old one, without breaking any implementation).

@Jaifroid
Copy link
Member Author

Jaifroid commented Mar 1, 2021

@mgautierfr OK, that makes sense -- I hadn't thought about that issue (entryCount) -- so that's the reason I wasn't seeing!
A new entry in the header for the length of titlePtrLstV2 would be a quick way to access this list without having to do a binary search early on in the ZIM loading process. Your call! If we don't have it, I'll write the code for the binary search on X namespace (well, I can leverage existing binary search code).

@mgautierfr
Copy link
Member

I will check.

I have check and I was wrong.
I've update new zim files with X/listing/titleOrdered/v[01].

A new entry in the header for the length of titlePtrLstV2 would be a quick way to access this list without having to do a binary search early on in the ZIM loading process. Your call! If we don't have it, I'll write the code for the binary search on X namespace (well, I can leverage existing binary search code).

It would go in the opposite direction I want to go :)
I would like to make the "minimal zim file" smaller as possible. title listing, title index, fulltext index, wellknown entries, new features ... would be optional and discoverable with the existence of the corresponding entry in the zim file.
It is not for tomorow (if it is one day at all), but I will not add new things in the header.

to access this list without having to do a binary search early on in the ZIM loading process.

There is very few use case where we open a zim file without doing several binary searches. Just displaying a list of available zim file imply to do several binary searches to get the title/author/date/information/articlecount/favicon/....
One extra binary search will not really slow the whole process.

@Jaifroid
Copy link
Member Author

Jaifroid commented Mar 1, 2021

Thanks for updating the ZIM archives!

I completely understand about not wanting to extend the header. My concern was only because binary search in JavaScript is much slower than in C++,, though we've done a lot to optimize it. But it's perhaps more the overhead of writing the support for this feature... NB we don't do binary searches on unopened archives (not technically possibly in JS in any case, except in rare cases where a user has picked a folder using the File System Access API and would be ridiculously slow/memory intensive).

In any case, it should not be too hard now to code for this. Thanks for your help, @mgautierfr!

@Jaifroid
Copy link
Member Author

Jaifroid commented Mar 2, 2021

Success in locating the new indices 😊 :

image

@Jaifroid
Copy link
Member Author

Jaifroid commented Mar 2, 2021

@mgautierfr Just to let you know that in two of the nons ZIMs I've tested so far, Ray Charles and Basketball, the X/listing/titleOrdered/v1 exists but appears to be empty. X/listing/titleOrdered/v0, when accessed from the path, contains the expected data. I can work with this for now, as all I need to do is to set up our reader to use the latter, and then it will be easy to adapt it to use the former if it exists. But I thought I should let you know.

@Jaifroid
Copy link
Member Author

Jaifroid commented Mar 2, 2021

Assumptions -> justifications:

  1. A titleOrdered listing exists inside a single uncompressed cluster -> fileimp.cpp#L167 has comment "this is a ZIMFile format error" if the dirEntry's cluster is compressed;
  2. A titleOrdered listing exists as a single BLOB contained in the dirEntry's cluster -> fileimpl.cpp#L172 gets the offset of a single BLOB
  3. The titleOrdered listing's size needs to be calculated by subtracting the pointer of BLOB number n (containing the listing) from the pointer for BLOB n + 1 (which might be the next blob, or might be the end of the cluster's data) -> cluster.cpp#L128
  4. The number of entries in a titleOrdered listing = listingSize / 4 (we need this so that we can look up titlePtrList entries in urlPtrList by index number)
  5. For titleOrdered/v0 these calculations will always yield the same values as are already given in the ZIM archive header for titlePtrPos and entryCount (but in a future ZIM version they could hypothetically be removed)
  6. For titleOrdered/v1 these calculations will yield different values, with a number of entries smaller than the archive header's entryCount (except, hypothetically, for a ZIM that only contains articles and no assets).

@mgautierfr
Copy link
Member

All your assumption are true. Everything is in the spec : https://wiki.openzim.org/wiki/Search_indexes

titleOrdered listing (as xapian indexes) are items (articles) stored in cluster, as any other items.
titleOrdered has mimetype application/octet-stream+zimlisting which is not compressible.
So they are stored, in a uncompressed cluster, in a single blob. The size of the item is the offset(n+1)-offset(n) (as all other items)

titleOrdered/v0 exists only to have some kind of coherence in the spec while keeping compatibility.
We will have a titlePtrPos. It is not a huge lost to add a dirent/cluster around the data to homogenize how we locate the data.


We may add another listing in the future (case insensitive ordered, media listing, whatever).
We have no plan at all for now so no worry about this. But it may be a good idea to design your code to have some kind of optional listing if you can do it without extra cost.

@Jaifroid
Copy link
Member Author

Jaifroid commented Mar 3, 2021

Thank you @mgautierfr. We can easily have an array of listings fed to an accessor function that returns the start pointer and the number of entries of each listing.

@Jaifroid
Copy link
Member Author

Jaifroid commented Mar 7, 2021

@mgautierfr I have basically finished coding support for ZIMs with v1 article-only title pointer lists in #709. Am I right that the test ZIMs you updated on 1st March have empty data in X/listing/titleOrdered/v1? At least Ray Charles, Basketball and Wikispecies are all returning a pointer list size of 0 for the v1 listing (the v0 listing is correct).

@mgautierfr
Copy link
Member

I've just upload a new version of zim files. Previous one was generated with a version of zim-recreate without this change, so no items was in the v1 title list.

It should be good now. v1 listing should not be zero sized (but it is something you should be prepared for (at least discard the listing, fail gracefully, ...))

@Jaifroid
Copy link
Member Author

@mgautierfr Thank you very much for the updated archives. I shall incorporate a test for a listing with no entries and fall back to the next best index in that case.

@Jaifroid Jaifroid linked a pull request Apr 7, 2021 that will close this issue
@Jaifroid Jaifroid reopened this Jul 15, 2021
@Jaifroid
Copy link
Member Author

Following on from #722, I've determined the issue (I think) that is causing the v1 listing to fail in the beer stackexchange ZIM. Although this ZIM contains a listing X/listing/titleOrdered/v1, its metadata indicates that its size is 0 (see screenshot). At this point in the code I have a comment:

// Note that we do not accept a listing if its size is 0, i.e. if it contains no data
// (although this should not occur, we have been asked to handle it - see kiwix-js #708)

Hence, there appears to be something wrong with the listing. @mgautierfr the relevant comment of yours about this is #708 (comment) . Do you know why this production ZIM appears to have a size of 0 for its v1 listing?

image

@Jaifroid Jaifroid mentioned this issue Jul 15, 2021
13 tasks
@Jaifroid
Copy link
Member Author

Jaifroid commented Jul 15, 2021

Adding some further diagnotic info: the Directory Entry for this listing, together with its metadata:

image

@Jaifroid
Copy link
Member Author

@mossroy I've added diagnostics enabling you to open this beer_stackexchange ZIM and see above output in console log (include debug level) in this branch: https://github.com/kiwix/kiwix-js/tree/Diagnostics-for-no-ns-ZIMs . You'll see there are several other errors, including not-found images and stylesheets, but I'm focusing on getting the Listings correct before looking at any of those. It does look like a dodgy ZIM, though!

@rgaudin
Copy link
Member

rgaudin commented Jul 15, 2021

@Jaifroid saving you some time ; those new sotoki Zims use a <base /> node so every link and resources are relative to that base. So Should you have a non-base-aware check (like the zimcheck's regex-only one), you would find invalid links.

As it's a close to zero effort to not use <base /> versus an heavy one to use a DOM+ parser on zimcheck, we'll probably change that in the near future.

@Jaifroid
Copy link
Member Author

Thanks, @rgaudin that's helpful for the style and image issues I notice. As far as I can tell, the landing page of this ZIM has <base href(unknown) /> which is odd (see screenshot). Could it be that you populate this dynamically? If so, that's a bad idea as we have to support reading a ZIM without running JavaScript inside the ZIM (in contexts where Service Worker is not supported, we cannot run dynamic content from the ZIM).

image

But it gets worse: in the mode that does support dynamic content (we call it "Service Worker mode"), the landing page of this ZIM seems to perform a top-level navigation which overwrites our software, and leaves us with only the landing page loaded... Obviously no links then work at all, as our software is no longer present to intercept Fetches...

@rgaudin
Copy link
Member

rgaudin commented Jul 15, 2021

Hum, no. here's what I get from the Zim, which corresponds to what to put inside:

import libzim.reader
zim = libzim.reader.Archive("/Users/reg/data/stackexchange/beer_stackexchange_com_2021-07.zim")
zim.main_entry
# Entry(url=mainPage, title=mainPage)
bytes(zim.main_entry.get_item().content).decode("UTF-8")[:200]
# '<!DOCTYPE html>\n<html class="html__responsive  html__fixed-top-bar">\n    <head>\n        <meta charset="utf-8">\n        <base href="">\n        <title>Highest Voted Questions - Beer, Wine and Spirits St'

As you can see, the actual HTML code is <base href="">.

As mentioned in my previous comment, I understand working with <base /> can be tricky for other tools so I'll update the scraper not to use it. I'll send you another Zim so you can focus on the other problem.

@Jaifroid
Copy link
Member Author

Thanks! I was showing you the HTML of the rendered page, rather than the input string. But reverting to the standard ZIM hyperlink conventions (i.e. relative to current URL, including when the URL is inside subdirectories which are common in Stackexchange) would be very helpful.

@mossroy
Copy link
Contributor

mossroy commented Jul 16, 2021

It's possible that the <base/> tag causes issues, at least in jQuery mode. If it's possible to avoid it, it would be better.

In ServiceWorker mode, our UI disappears when opening this ZIM file. There must be a javascript code that detects it is run inside our iframe, and moves itself in the enclosing window (replacing our UI). It's a "protection" that some websites use

@rgaudin
Copy link
Member

rgaudin commented Jul 16, 2021

OK, we do include some JS from both MathJax (doubt it uses this technique) and StackExchange (so we can use their highlighting feature). Both should be conditional and not enabled for this particular ZIM (we have tickets for that).
Will send you a base-less Zim soon and will look at removing the JS as well for beer.

@Jaifroid
Copy link
Member Author

@rgaudin The MathJax module works fine with Kiwix JS in current Stackexchange ZIMs, so it's definitely not that one causing the issue. I doubt highlighting would cause the break out of the iframe (but you never know!)...

@rgaudin
Copy link
Member

rgaudin commented Jul 16, 2021

The thing is we don't use highlight.js directly but fire the whole Stack JS toolbox and tell it to enable highlighting so it's automatically configured properly. That JS code dynamically imports a bunch of other stuff.

@rgaudin
Copy link
Member

rgaudin commented Jul 21, 2021

@mossroy @Jaifroid here are a couple new Zims without the base:

Let me know if you need anything else

@Jaifroid
Copy link
Member Author

Jaifroid commented Jul 22, 2021

@rgaudin Thank you very much. The removal of the base tag definitely helps. However, there are two issues remaining:

  • Both of these ZIMs have invalid metadata in the ZIM header: the header indicates the archive's minorVersion property is 1, and that they therefore should use the new v1 title pointer lists, but X/listing/titleOrdered/v1 has a size of 0 (or at least its size is declared as zero in the dirEntry metadata), and so it can't be used. The reader has to fall back to using the old v0 title pointer list. See screenshot below for diagnostic. I think this is an issue for @mgautierfr.
  • The landing page of the ZIM still breaks out of the iframe we enclose it in and overwrites itself at the top level, destroying our UI and making the ZIM unusable in our Service Worker mode. We have a legacy way of reading the archive which can read it because it disables JavaScript in the ZIM, but it would be much better if the offending script were removed from the ZIM, rather than us having to disable all JS in the ZIM.

image

@rgaudin
Copy link
Member

rgaudin commented Jul 22, 2021

Will look into the JS issue but the version one is probably at (py?)libzim level and for @mgautierfr. He'a away this week though.

@Jaifroid
Copy link
Member Author

Yes, it's definitely a libzim issue. There's discussion of handling this very error above, even though it shouldn't happen!

@Jaifroid
Copy link
Member Author

Will look into the JS issue but the version one is probably at (py?)libzim level and for @mgautierfr. He'a away this week though.

If I can help in identifying which included script is causing the issue, let me know.

@kelson42
Copy link
Collaborator

@rgaudin @Jaifroid Please open ticket at libzim level.

@Jaifroid
Copy link
Member Author

@rgaudin @Jaifroid Please open ticket at libzim level.

I opened openzim/libzim#590.

@rgaudin
Copy link
Member

rgaudin commented Jul 30, 2021

Will look into the JS issue

@Jaifroid sorry for the long delay. Here are a 2MB zim and a 15MB zim without the extraneous JS. All the JS included in this one are so on purpose. If there's still a problem, let me know.

Still nothing in the titlelist though

@rgaudin
Copy link
Member

rgaudin commented Aug 3, 2021

@Jaifroid this new zim should now have the title list filled as it is now using FRONT_ARTICLE Hints.

@Jaifroid
Copy link
Member Author

Jaifroid commented Aug 5, 2021

@rgaudin I confirm that this latest ZIM now has 221 title entries in the v1 title index (out of 557 total dirEntries). I also confirm that the landing page no longer attempts to break out of the iframe, so this ZIM can be used perfectly in Kiwix JS 😊. Diagnostic screenshot below.

image

@kelson42
Copy link
Collaborator

kelson42 commented Aug 5, 2021

@Jaifroid We ca close this ticket then I guess?

@Jaifroid
Copy link
Member Author

Jaifroid commented Aug 5, 2021

@kelson42 Yes, I think our support for this is now complete.

@Jaifroid Jaifroid closed this as completed Aug 5, 2021
@mossroy
Copy link
Contributor

mossroy commented Aug 20, 2021

@rgaudin : I see some issues specifically with http://tmp.kiwix.org/alcohol_meta_stackexchange_com_2021-08.zim : I suppose they should be reported on https://github.com/openzim/sotoki ?

@rgaudin
Copy link
Member

rgaudin commented Aug 20, 2021

that's right

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants