-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zimit v2. [libzim/libkiwix/warc2zim part] #95
Comments
As @rgaudin mention in kiwix/kiwix-android#3485 (comment) (I've totally missed the explained behavior), we have to properly handle external link. Static rewritingWe are (will) rewriting all links ( The only way I see to avoid that (if we want to avoid it) is to parse a first time the warc to know all the entries and then do the (classic) handling of content but rewrite only link to existing entry (and keep other link as external links) Dynamic rewriteWe do the same as static rewrite. But there, we are in the browser and checking if the entry exist before we rewrite it means at least a request. Server handlingIf we have a request for a non existent entry and if path is Questions :
In fact, if we accept only links navigating to other website and we assume they can be only in html pages, we can simply do the "complex" static rewrite and we are good. |
@mgautierfr Thank you for documenting your thinking so carefully. I've belatedly read through it. There's a lot here, but three things stood out for me: 1. Common URL schemaYou propose:
From the work I've done with warc2zim, I'm really not sure this is a valid distinction. I have noticed that some ZIMs contain valid resources to a wide range of sites. And if you think about it, this is necessary given that a page may be grabbing its JS from a CDN, or images from another domain owned by the company, and especially for video which is almost always from a different domain but is often embedded in a page, and may be first party (or may be YouTube). It could get very difficult to decide what is first-party and what is third-party, and I think having a rigid distinction like that could break some sites. An example: a recent Mozilla Development Network scrape contains not only pages from MDN, but also several older MDN pages from archive.org that are linked to and scraped and displayed offline in the ZIM inside an archive.org frame! Now, that may be a mistake by the person who launched that scrape, but in other cases it won't be a mistake. I'm not sure the distinction holds. It might be better to design a more flexible format upfront that allows arbitrary numbers of domains to be stored. Currently this is actually quite logical. The domain name is included in the ZIM URL, like 2. Usefulness of Headers (pseudo H namespace)My custom implementation in the PWA is designed mostly to make largely static resources readable (though it can rewrite most links in CSS and JS scripts, just not those that are constructed highly dynamically at run-time unless I'm lucky). Although I mostly ignore the headers, I found that sometimes they are needed. The main use case was to find a redirected resource. Sometimes that information is in the initial response body, but sometimes the server has only sent a redirect header, and there is no Response body. So, I have a recursive lookup: if a requested resource is not found at Now, while redirect may be the main use case, there are several other reasons to use the headers in more dynamic situations. The Service Worker has the logic that deals with this. I found 18 references to a function
My conclusion about HeadersI found the seven broad categories above where Response Headers are needed (and there is some code for Request Headers too). So, ISTM that to deal with the huge variety of situations in which we may have things such as range requests (especially for streaming data), or AJAX or Fetch requests, and the fact that WARC can intercept these and record the responses, it would be risky to ditch the capacity for storing and using the Headers. 3. Video BLOBs or streams of requests and responses?You ask above whether video is stored (effectively) as BLOBs or as streams (chunks). I think the point of the WARC format is that it could be either. I don't think the fact that the Android app reads BLOBs from the ZIM in a normal (non-WARC ZIM) is relevant. If the Service Worker is doing its job correctly, it will bypass that. All the Service Worker is doing is effectively intercepting requests and providing responses (yes, it has a lot of logic to do transformations, but basically it is just doing what all Service Workers do: there is an event listener on the Fetch event, and the SW does So, my experience is that in MOST cases of YouTube videos (the ones I have implemented in the PWA), there is an identifiable MP4 BLOB (after fuzzy URL transformation / reduction). But of course YouTube COULD simply stream video chunks, and have some complex JS reader that recombines them only when the right authentication response has been sent to the server. The WARC format doesn't care about this. It will merely record the authentication response sent to the server and the encoded chunks received, and the piece of JS that recombines the chunks will be happy. And, I think, Kiwix Android will also be happy because it's not reading the video in the way it would read video from a Wikimedia ZIM file. The webview is just making a request, and the response is elicited from the ZIM by the Service Worker's transformation functions, and these are sent back to the WebView, which has a JS player, and all is good (maybe!). In any case, I don't think it's safe to assume we'll always have a BLOB to play rather than a stream. We need to design Zimit 2.0 in a way that is flexible and future-proof, which means that multimedia content is also just a set of requests and responses. |
1. Common URL schemaI think you misunderstood the url schema.
We can still store any content from any website. Without any limitation. It is just that we have one domain which is elided from the entry path and we know this is the "main" domain of the scrapped website. The main purpose is to avoid to have the domain visible in the url from a user point of view 2. Usefulness of Headers (pseudo H namespace)1. REDIRECTS:
We already a mechanism for redirection. We should use them. If a warc record contains a redirection response, we must create a redirect entry. No need for header for that. 2. MIME TYPES / transfer encodingFor mime types, as for redirect, we can already store it in the zim. 3. COOKIES
That's a interesting point. But it appears that cookies is my next thing to make work. So I will see :) 4. COMPENSATING FOR SW RUNNING IN AN EXTENSION (could be important for Kiwix JS!)You should be able to get this information (redirect) from classic zim file as we will store classique redirect entry (or alias, which will lead to even less work on your side) 5. FORMS and UPLOADS
I wonder why you need the lengthHeader. By definition, the server doesn't handle POST request so it is somehow useless to send data to the server. (And on warc2zim, we move all data of a POST request in the entry path querystring
This is same a redirect 6. FETCH AND RANGE REQUESTSIndeed this is something we have to handle. But we can move this information in the path, as we do for POST data. 7. AJAX REQUESTS
What do you do if it is not response ? Rewrite the content ?
We never had the capacity to store and using the Header :) So we ditch nothing :) 3. Video BLOBs or streams of requests and responses?
Well, the purpose of zimit v2 is to not have a Service Worker. So no one can do its job correctly (or not).
If I understand correctly the android behavior, the purpose it to not use the js player (or the webplayer) but use the "native player". It allows the video to be directly played by android native code, bypassing all the app/webview/server/libzim code. But to do this, we need a contiguous data.
I agree, but it has a impact on readers that have this assumption. (And a valid one as we didn't have a way to store different range of data in different entries, so we always had one entry per content) BTW, here a small teaser of a zim created with dev of warc2zim. It is without service worker and should work without fuzzy matching or any fancy stuff. (Not working, at least : cookies, external link handling) |
I'd also prefer a single way to store entries, for the sake of not having to handle two. Maybe this was chosen to have better-looking paths for the main domain. @mgautierfr what's the reason for the two entries format?
Thank you for laying them all out. It's really useful. We've discussed a couple of them as theoretical possibilities but haven't encountered them in reality. It all looks like it can be gradually introduced back. We should probably setup a bunch of websites that trigger and uses some of those use cases so we can have automated tests.
I've said the same thing a few times but lacked an actual use case to back it up. It's very frequent on my own laptop to see non-blobs being transferred ; and there are multiple competing stream technologies. Each need to be implemented thouhg |
It is still allowed with the schema proposed (and implemented for now).
Just have urls which look like we used to. Storing the host in the entry path ( http://public.kymeria.fr/KIWIX/zimit2/kiwix_no_main_domain.zim is the same zim without url simplified. |
Yep, I saw your comment just after publishing mine. |
Thanks for the explanations and reassurances, @mgautierfr. I hope at least that the research on the use cases of headers was useful. I hadn't understood the logic behind the URL proposal -- I see now that it's just a form of abbreviation, and in fact it works just as well without the abbreviation, so it's optional. Presumably the main use case for abbreviated URLs is in browsers accessing a ZIM via Kiwix Serve, because I don't think in any other context users are particularly aware of URLs (and in many contexts, they can't see thm at all). The main reason for POST requests would be to record visits to sites where a POST is used to get a resource without it being in the URL (as POSTing without relying on querystrings is considered more secure). But I imagine this is a bit unlikely for a ZIM, except for google video, which you've already implemented via a separate process. Congratulations 🎉on those ZIM samples. I've tested both in Kiwix JS and in Kiwix PWA, and (apart from a small issue with some hyperlinks having a /C/ in them that should be easy to fix in our Service Worker, that comes from differences in our backend) they are working very well: all JS, CSS, etc. is loading correctly on the landing page, and most hyperlinks work fine. That's certainly remarkable! |
I jump on this very long discussion. I hope I get it right and make a useful comment. That said, I would really prefer to have one ticket per fundamental change. That said:
|
Good point!
Attention, this comparison is too simple: we only do this in select scrapers (sotoki, mwoffliner and maybe wikihow) for which we know we're working off a tiny list of basic nodes. This can't be compared with zimit where possibilities are all those offered by HTML and JS. That's why we rely (or will be relying) on Wombat.js Not sure what you meant with “static rewriting” but if the goal is the same, the implementation is gonna be different (and more complex): an external link has no other property than “not being in the ZIM”. Wombat running in the client (to intercept calls), client side must be able find out if an entry is in ZIM or not. |
I agree. Wombat is a too complex and sensitive (at least with my knowledge) to play to much with him. I have made the eliding optional and I'm testing without it.
We (will) do static rewriting. In the example zim files, all html (almost, not html for ajax requests) content is statically rewritten. But we need to dynamically rewrite url (coming from js request) and content (response of ajax request) |
@mgautierfr Having finally managed to integrate the Replay system (with Service Worker) in kiwix/kiwix-js#1173, I have a better understanding of the importance of headers. You wrote above:
I've come to realize (belatedly) that while the headers are not (generally) important for looking up assets from the backend / server, they are potentially important instructions to the user's browser about how to deal with those assets. I apologize in advance if that's really obvious to everyone else, but I think in previous discussions we (or at least I) were focusing on how they might help us look up assets directly (the simple revisits you mention), rather than the fact they tell the browser how to deal with retrieved payloads. OverviewCurrently a client accessing a Zimit article via Kiwix Serve will:
CodeThe high-level code that does this in the Replay Service Worker is below. I've added some comments to make it quicker to parse (for a human), but the comment about a bug in Kiwix serve, and the Obviously, there's a lot more going on behind this top-level code, but for me it gives the clearest picture of what is happening, and therefore how / if to emulate use of headers in Zimit 2.0. Again, sorry if this is stating the obvious, but it seems useful to document it, even if only for the benefit of others finding this:
|
@mgautierfr A potentially interesting observation from my work on enabling Replay support in Kiwix JS-family apps. NB, this is not a recommendation to change approach, just an observation that might be useful as a fallback. So, I'm just putting it out there. Feel free to shoot this down! I think your approach is ultimately more universal, as it creates a standard ZIM that existing readers should be able to use without changes to their backends. I realized belatedly that Now I wondered if the webview used by Kiwix Desktop can run Web Workers. They're pretty old technology. Even IE11 can run a Web Worker (though obviously not this one, because it uses very advanced JS, lots of async, etc.). And if that's the case, can the Webview catch all Fetch requests within its scope (like what a Service Worker does)? Again, if that's the case, ISTM that there might be a "simple" solution for supporting current Zimit ZIMs. (Well, nothing is ever "simple"...) Of course, this would also require work in the Kiwix Desktop backend, and probably, like in Kiwix JS, would require hosting your own custom copy of Caveats:
|
Closing in favour of openzim/zimit#193. See also https://github.com/orgs/openzim/projects/10 |
This is a ticket to list what need to be done to make the PR openzim/warc2zim#113 going from a POC to a real feature.
Specification
Improvement of the current specification to support warc2zim requirement.
Following openzim/warc2zim#113 we need to make evolved the current kiwix/zim format.
The zim file format itself (binary way to store content) will not evolved we will make evolved the "kiwix" format (what we store and how we interpret it).
While this is the "kiwix" format which evolve, this is still a low level change anyway and we may change libzim itself (both at reading and creation time) to support this new format.
The main change is :
[ ] To store fuzzy matching rulesAliasing
WARC file contains revisit : Entry which need to be served by the content of another one.
The current POC use
H
namespace to store redirects that need to be handled as alias.We can do the same. Or we can do as hard link are done :
Two (or more) entries are content entries and point to the same content (blob/cluster id or redirect id)
Using "hard link" would need to adapt the libzim creator side but no change at all would be needed on specification or reading part.
(However, zim-check would need to be adapted as it will find duplicate content)
Fuzzy Matching
Fuzzy matching is a way to transform a (potentially not fixed) url into a fixed, known one.
There is two part for fuzzy matching:
On the specification part, we need to define how we store the reading fuzzy matching rules.
Also need to define who is applying it (we need to access the query string, is it libkiwix doing several request to libzim ? Or libzim doing the transformation, but we need to pass it the query string ?)
Implementation
[ ] libzim: Support storing and retrieving fuzzy rules (including parsing of them)[ ] libzim/libkiwix: Make evolve the "routing" part. (apply fuzzy rules, search for potential entries, ...). As the routing becomes more complex (than simply search entry from the given url), it may the time to implement : Expose (kind of)Not needed. The specific name scheme (and especially that url are url encoded) allow to resolved everything from libzim.InternalServer
in the public API ? libkiwix#740Warc2zim
Once libzim/libkiwix is providing the needed feature, we need to adapt warc2zim.
Common url schema
We need to define where (using which url) we store our entries.
I suggest:
absolute/path
host.tld/absolute/path
.This way, "origin host" url are the same as "non zimit" zim file.
We also remove the
A
/H
sub-directory which is a relic of namespaces.Implementation
warc2zim : url rewriting. We need to parse the content (html/css) and rewrite the url. Rewrote urls must be relative path and conform to the common url schema.
We can reuse pwb with monkey patching as we do in the POC. Or re-implement ours.
This part is somehow relatively simple. We don't do any fuzzy matching or else. Simply rewrite url to the common schema. It may be simpler to start from scratch than integrating a project not designed to be integrated.
warc2zim : Integrate dynamic url rewriting. Definitively too complex to re-implement. Wombat.js has been put in a separated repository and is advocated to be embeddable. Let's do it ! So we must modify html (and js content ?) to insert some js installing wombat in each page.
. [ ] wombat : While wombat is supposed to be embeddable (and it is), it seems there is no way to specify our own url rewrite function. We need to make PR on wombat to make this configurable.
. [x] wombat/warc2zim: Implement the rewrite function (js) and use it in wombat
. [x] wombat/warc2zim: In the poc, we configure wombat with few fixed, absolute values. We need to make this relative. Maybe simple configuration or need PR on wombat.
Other projects:
java-libkiwix: update to new libzim/libkiwix apikiwix-desktop: use new features (routing handling). (Or do it ourselves)kiwix-android: use new features (routing handling). (Or do it ourselves)zim-dump : We have to properly dump alias (hard-link ?) and fuzzy-rules (nginx rewrite rules ?)See How shouldzimdump
deal with aliases openzim/zim-tools#395Open questions :
[This may need to adapt the zim/kiwix specification, warc2zim and libzim/libkiwix routing)
But android video player use direct access to read the video. So we need to "regroup" records in one entry.
TODO : See if videos are really composed of several records ? If yes, how to detect records are about the same video ? How to rebuild a single entry ? How fuzzy rules matching will work with regrouped entry ?
The text was updated successfully, but these errors were encountered: