-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary format for the static prediction tables #7
Comments
I got some data with the latest dictionary builder. We'll need an efficient format to actually ship dictionaries. |
A few things to recall:
In turn, this shipping as diffs suggests that we want to ship number of occurrences (easy to add to without precision loss), rather than actual probabilities (floating points, harder to manipulate). |
Recapping some conversations from IRC here. Firstly, I feel we should settle on some consistent terminology. "Dictionary" is a very overloaded term and suggests words or character-sequence tables of some sort. I would suggest using "Prediction Tables" to talk about this. They are mappings from various contexts (e.g. Tree-path-suffix, or move-to-front cache of strings) to a table which acts as a predictor that provides a probability distribution for elements showing up at some location in the data stream. That said, the key thing that's relevant here is that we have decided that prediction tables will not be shipped in-line with the content, and that we will allow compressed content to reference a prediction table (e.g. with a URL + verification hash). This effectively offloads the problem entirely. Initially, we can produce a "standard" prediction table derived from a "general corpus" of javascript. This is the default table which tools can use. This table can be made available at a standard "well-known" location (or shipped with the browser, as an implementation detail). In the general case, an individual binast file may reference a custom prediction table. However, we know that we can expect that these tables will change very frequently, and will be effectively be cached locally for every request that is made to content that references it (except for the first). From the browser's side, the logic for decoding an incoming binast file A referencing a table T would be:
In this scenario, the "default table" is simply just another table that may be referenced. Whether it's shipped with the browser or not is an implementation detail that doens't matter from a performance perspective. |
Pinging @dominiccooney for discussion alert. |
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde (de)serialization mechanism. While we have not decided a format for storing dictionaries (that's binast/container-format-spec#7 ), this will let us experiment using a simple, temporary format.
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde (de)serialization mechanism. While we have not decided a format for storing dictionaries (that's binast/container-format-spec#7 ), this will let us experiment using a simple, temporary format.
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde (de)serialization mechanism. While we have not decided a format for storing dictionaries (that's binast/container-format-spec#7 ), this will let us experiment using a simple, temporary format.
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde (de)serialization mechanism. While we have not decided a format for storing dictionaries (that's binast/container-format-spec#7 ), this will let us experiment using a simple, temporary format.
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde (de)serialization mechanism. While we have not decided a format for storing dictionaries (that's binast/container-format-spec#7 ), this will let us experiment using a simple, temporary format.
Note that it's a bit more complicated than that, as:
We should move this conversation in the relevant bug, though: #4 . |
Since we want to ship dictionaries, we need to specify their format, too.
I'm pretty sure that we can use efficient entropy coding also for the dictionaries, fwiw.
The text was updated successfully, but these errors were encountered: