You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've had an idea of a simple model/encoding for strings, I'd like to test it at some point.
It starts from the remark that any string we encounter more than once may be represented either:
as an index from start;
as an index from latest string;
if the string is part of the AOT dictionary, as an index in that dictionary.
As discussed, some strings tend to have many instances in a given window, while others don't. I suspect that, by picking the best of these three representation, we'll be able to reduce the size.
We represent this as the following alphabet:
enumSymbol{/// Well-known string, stored in the dictionary we shipped with the encoder/decoder.BuiltInDictionary(usize),/// A string already referenced in this file, as indexed from the start./// 0 is the first string encountered in the file, 1 the second, ...FromStart(usize),/// A string already referenced in this file, as indexed from the current position./// 0 is the latest string encountered in the file, 1 the previous, ...FromCurrent(usize),/// A new string, never before encountered./// Must be followed by a literal string.New}
Now, whenever we encounter a string, we add it to the following in-memory tables. Both tables will let us find how to represent a string, when we next encounter it, using either Symbol::FromStart and Symbol::FromCurrent:
pubstructState{/// First index of a given string. When we use this, we try and keep numbers small.first:HashMap<Rc<String>,usize>,/// Latest index of a given string.latest:HashMap<Rc<String>,usize>,// ...};
We then add statistics, to find out which is best representation of a string
pubstructState{// .../// A mapping from `index` to number of times we have used/// `Symbol::BuiltinDictionary(index)`.frequency_built_in:VecMap<usize>,/// A mapping from `index` to number of times we have used/// `Symbol::FromStart(index)`.frequency_from_start:VecMap<usize>,/// A mapping from `index` to number of times we have used/// `Symbol::FromCurrent(index)`.frequency_from_latest:VecMap<usize>,}
With these two pieces of information (first/latest and frequency_*), we may find, for each string, the most common symbol we may use to represent it.
The text was updated successfully, but these errors were encountered:
I've had an idea of a simple model/encoding for strings, I'd like to test it at some point.
It starts from the remark that any string we encounter more than once may be represented either:
As discussed, some strings tend to have many instances in a given window, while others don't. I suspect that, by picking the best of these three representation, we'll be able to reduce the size.
We represent this as the following alphabet:
Now, whenever we encounter a string, we add it to the following in-memory tables. Both tables will let us find how to represent a string, when we next encounter it, using either
Symbol::FromStart
andSymbol::FromCurrent
:We then add statistics, to find out which is best representation of a string
With these two pieces of information (
first
/latest
andfrequency_*
), we may find, for each string, the most common symbol we may use to represent it.The text was updated successfully, but these errors were encountered: