-
-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support version 1 of titlePtrList accessed from X/listing/titleOrdered/v1 #708
Comments
@mgautierfr Is it safe for our code to make the following assumptions in order to find
If the above assumptions are not safe ones to make, how do you recommend we find these listings without using A final query is why you cannot undertake to keep a meaningful value in the |
Sorry for bombarding @mgautierfr, but do any of the ZIM archives in http://tmp.kiwix.org/nons_zims/ have these
I tested this by forcibly setting the search namespace to |
Yes.
No. We don't plan to add another namespace for now but it is something possible.
The first version of the zim files didn't have these v0 and v1 indices. |
Thanks for the reply @mgautierfr. I was indeed using those new versions (I checked the date), and I'm afraid I can only find the two Xapian indices in the X namespace (using On your other comments, thanks for the clarification that we should use the |
I will check.
I'm not sure to understand what you are suggesting. |
@mgautierfr OK, that makes sense -- I hadn't thought about that issue ( |
I have check and I was wrong.
It would go in the opposite direction I want to go :)
There is very few use case where we open a zim file without doing several binary searches. Just displaying a list of available zim file imply to do several binary searches to get the title/author/date/information/articlecount/favicon/.... |
Thanks for updating the ZIM archives! I completely understand about not wanting to extend the header. My concern was only because binary search in JavaScript is much slower than in C++,, though we've done a lot to optimize it. But it's perhaps more the overhead of writing the support for this feature... NB we don't do binary searches on unopened archives (not technically possibly in JS in any case, except in rare cases where a user has picked a folder using the File System Access API and would be ridiculously slow/memory intensive). In any case, it should not be too hard now to code for this. Thanks for your help, @mgautierfr! |
@mgautierfr Just to let you know that in two of the |
Assumptions -> justifications:
|
All your assumption are true. Everything is in the spec : https://wiki.openzim.org/wiki/Search_indexes
We may add another listing in the future (case insensitive ordered, media listing, whatever). |
Thank you @mgautierfr. We can easily have an array of listings fed to an accessor function that returns the start pointer and the number of entries of each listing. |
@mgautierfr I have basically finished coding support for ZIMs with v1 article-only title pointer lists in #709. Am I right that the test ZIMs you updated on 1st March have empty data in |
I've just upload a new version of zim files. Previous one was generated with a version of zim-recreate without this change, so no items was in the v1 title list. It should be good now. v1 listing should not be zero sized (but it is something you should be prepared for (at least discard the listing, fail gracefully, ...)) |
@mgautierfr Thank you very much for the updated archives. I shall incorporate a test for a listing with no entries and fall back to the next best index in that case. |
Following on from #722, I've determined the issue (I think) that is causing the v1 listing to fail in the beer stackexchange ZIM. Although this ZIM contains a listing
Hence, there appears to be something wrong with the listing. @mgautierfr the relevant comment of yours about this is #708 (comment) . Do you know why this production ZIM appears to have a size of 0 for its v1 listing? |
@mossroy I've added diagnostics enabling you to open this |
@Jaifroid saving you some time ; those new sotoki Zims use a As it's a close to zero effort to not use |
Thanks, @rgaudin that's helpful for the style and image issues I notice. As far as I can tell, the landing page of this ZIM has But it gets worse: in the mode that does support dynamic content (we call it "Service Worker mode"), the landing page of this ZIM seems to perform a top-level navigation which overwrites our software, and leaves us with only the landing page loaded... Obviously no links then work at all, as our software is no longer present to intercept Fetches... |
Hum, no. here's what I get from the Zim, which corresponds to what to put inside: import libzim.reader
zim = libzim.reader.Archive("/Users/reg/data/stackexchange/beer_stackexchange_com_2021-07.zim")
zim.main_entry
# Entry(url=mainPage, title=mainPage)
bytes(zim.main_entry.get_item().content).decode("UTF-8")[:200]
# '<!DOCTYPE html>\n<html class="html__responsive html__fixed-top-bar">\n <head>\n <meta charset="utf-8">\n <base href="">\n <title>Highest Voted Questions - Beer, Wine and Spirits St' As you can see, the actual HTML code is As mentioned in my previous comment, I understand working with |
Thanks! I was showing you the HTML of the rendered page, rather than the input string. But reverting to the standard ZIM hyperlink conventions (i.e. relative to current URL, including when the URL is inside subdirectories which are common in Stackexchange) would be very helpful. |
It's possible that the In ServiceWorker mode, our UI disappears when opening this ZIM file. There must be a javascript code that detects it is run inside our iframe, and moves itself in the enclosing window (replacing our UI). It's a "protection" that some websites use |
OK, we do include some JS from both MathJax (doubt it uses this technique) and StackExchange (so we can use their highlighting feature). Both should be conditional and not enabled for this particular ZIM (we have tickets for that). |
@rgaudin The MathJax module works fine with Kiwix JS in current Stackexchange ZIMs, so it's definitely not that one causing the issue. I doubt highlighting would cause the break out of the iframe (but you never know!)... |
The thing is we don't use |
@mossroy @Jaifroid here are a couple new Zims without the base:
Let me know if you need anything else |
@rgaudin Thank you very much. The removal of the base tag definitely helps. However, there are two issues remaining:
|
Will look into the JS issue but the version one is probably at (py?)libzim level and for @mgautierfr. He'a away this week though. |
Yes, it's definitely a libzim issue. There's discussion of handling this very error above, even though it shouldn't happen! |
If I can help in identifying which included script is causing the issue, let me know. |
I opened openzim/libzim#590. |
@Jaifroid sorry for the long delay. Here are a 2MB zim and a 15MB zim without the extraneous JS. All the JS included in this one are so on purpose. If there's still a problem, let me know. Still nothing in the titlelist though |
@Jaifroid this new zim should now have the title list filled as it is now using FRONT_ARTICLE Hints. |
@rgaudin I confirm that this latest ZIM now has 221 title entries in the v1 title index (out of 557 total dirEntries). I also confirm that the landing page no longer attempts to break out of the iframe, so this ZIM can be used perfectly in Kiwix JS 😊. Diagnostic screenshot below. |
@Jaifroid We ca close this ticket then I guess? |
@kelson42 Yes, I think our support for this is now complete. |
@rgaudin : I see some issues specifically with http://tmp.kiwix.org/alcohol_meta_stackexchange_com_2021-08.zim : I suppose they should be reported on https://github.com/openzim/sotoki ? |
that's right |
Having completed #698 I'm looking into how to support the new
titePtrList
. I note that libzim has implemented agetTitleAccessor()
function:https://github.com/openzim/libzim/blob/master/src/fileimpl.cpp#L158
There are two versions of the
titlePtrList
, one of which isX/listing/titleOrdered/v0
, which is the traditional one that is also accessed via thetitlePtrPos
byte contained in the ZIM file header. The other isX/listing/titleOrdered/v1
, which only contains article titles, and so this is the one we need to use for title search and random articles.I'm having trouble getting my head around how we can "bootstrap" the app to the point where it can read a title from the X namespace (in order to retireve these new
titlePtrList
s without first accessing the traditionaltitlePtrPos
that is in the archive header, but which is now considered obsolete, even though it points to the same offset represented byX/listing/titleOrdered/v0
.Can I use the
urlPtrPos
to binary search theurlPtrList
instead? I can't find code corresponding to this in libZim (but the fact that I can't find it means nothing: I don't know C++ well enough).The text was updated successfully, but these errors were encountered: