Binary format for the static prediction tables #7

Yoric · 2018-10-04T15:00:20Z

Since we want to ship dictionaries, we need to specify their format, too.

I'm pretty sure that we can use efficient entropy coding also for the dictionaries, fwiw.

Yoric · 2018-10-09T16:35:44Z

I got some data with the latest dictionary builder. We'll need an efficient format to actually ship dictionaries.

Yoric · 2018-10-09T16:38:38Z

A few things to recall:

whatever dictionary we ship today, we still need to ship in 10 years (the format may change, but not the contents);
we will need to ship new dictionaries whenever the JavaScript AST changes, which suggests we may wish to ship them somehow as diffs;
after a few versions of JavaScript, we'll end up with several dictionaries in memory at the same time;
developers may wish to ship a new dictionary because for some reason their code is not compressed correctly by our entropy coder, which also suggest we may wish to ship dictionaries as diffs.

In turn, this shipping as diffs suggests that we want to ship number of occurrences (easy to add to without precision loss), rather than actual probabilities (floating points, harder to manipulate).

kannanvijayan-zz · 2018-10-09T16:47:50Z

Recapping some conversations from IRC here.

Firstly, I feel we should settle on some consistent terminology. "Dictionary" is a very overloaded term and suggests words or character-sequence tables of some sort. I would suggest using "Prediction Tables" to talk about this. They are mappings from various contexts (e.g. Tree-path-suffix, or move-to-front cache of strings) to a table which acts as a predictor that provides a probability distribution for elements showing up at some location in the data stream.

That said, the key thing that's relevant here is that we have decided that prediction tables will not be shipped in-line with the content, and that we will allow compressed content to reference a prediction table (e.g. with a URL + verification hash). This effectively offloads the problem entirely.

Initially, we can produce a "standard" prediction table derived from a "general corpus" of javascript. This is the default table which tools can use. This table can be made available at a standard "well-known" location (or shipped with the browser, as an implementation detail).

In the general case, an individual binast file may reference a custom prediction table. However, we know that we can expect that these tables will change very frequently, and will be effectively be cached locally for every request that is made to content that references it (except for the first).

From the browser's side, the logic for decoding an incoming binast file A referencing a table T would be:

Receive binast file A.
Header of binast file A identifies probability table T by URL+content-hash
Check local cache for T (using content-hash as key) - highly likely to hit. Retrieve T and cache if necessary.
Proceed to decoding binast file A using probability table T.

In this scenario, the "default table" is simply just another table that may be referenced. Whether it's shipped with the browser or not is an implementation detail that doens't matter from a performance perspective.

kannanvijayan-zz · 2018-10-09T17:20:26Z

Pinging @dominiccooney for discussion alert.

This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde (de)serialization mechanism. While we have not decided a format for storing dictionaries (that's binast/container-format-spec#7 ), this will let us experiment using a simple, temporary format.

Yoric · 2018-11-08T14:29:19Z

From the browser's side, the logic for decoding an incoming binast file A referencing a table T would be:
Receive binast file A.
Header of binast file A identifies probability table T by URL+content-hash
Check local cache for T (using content-hash as key) - highly likely to hit. Retrieve T and cache if necessary.
Proceed to decoding binast file A using probability table T.
In this scenario, the "default table" is simply just another table that may be referenced. Whether it's shipped with the browser or not is an implementation detail that doens't matter from a performance perspective.

Note that it's a bit more complicated than that, as:

A and T need to be received by the browser (in Firefox, that's by Necko);
A and T need to be decompressed by the JS VM (in Firefox, that's SpiderMonkey).

We should move this conversation in the relevant bug, though: #4 .
Consequently, we can already imagine that SpiderMonkey will need to expose an API for Necko to inspect A to determine association with T.

This was referenced Oct 9, 2018

Define the format, metadata for shared probability tables, strings, etc. binast/binjs-ref#164

Closed

Towards a dictionary builder binast/binjs-ref#185

Merged

Yoric changed the title ~~Binary format for the dictionary itself~~ Binary format for the static prediction tables Oct 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary format for the static prediction tables #7

Binary format for the static prediction tables #7

Yoric commented Oct 4, 2018

Yoric commented Oct 9, 2018

Yoric commented Oct 9, 2018 •

edited

Loading

kannanvijayan-zz commented Oct 9, 2018

kannanvijayan-zz commented Oct 9, 2018

Yoric commented Nov 8, 2018 •

edited

Loading

Binary format for the static prediction tables #7

Binary format for the static prediction tables #7

Comments

Yoric commented Oct 4, 2018

Yoric commented Oct 9, 2018

Yoric commented Oct 9, 2018 • edited Loading

kannanvijayan-zz commented Oct 9, 2018

kannanvijayan-zz commented Oct 9, 2018

Yoric commented Nov 8, 2018 • edited Loading

Yoric commented Oct 9, 2018 •

edited

Loading

Yoric commented Nov 8, 2018 •

edited

Loading