A model for strings #1

Yoric · 2018-08-31T12:33:27Z

I've had an idea of a simple model/encoding for strings, I'd like to test it at some point.

It starts from the remark that any string we encounter more than once may be represented either:

as an index from start;
as an index from latest string;
if the string is part of the AOT dictionary, as an index in that dictionary.

As discussed, some strings tend to have many instances in a given window, while others don't. I suspect that, by picking the best of these three representation, we'll be able to reduce the size.

We represent this as the following alphabet:

enum Symbol {
  /// Well-known string, stored in the dictionary we shipped with the encoder/decoder.
  BuiltInDictionary(usize),

  /// A string already referenced in this file, as indexed from the start.
  /// 0 is the first string encountered in the file, 1 the second, ...
  FromStart(usize),

  /// A string already referenced in this file, as indexed from the current position.
  /// 0 is the latest string encountered in the file, 1 the previous, ...
  FromCurrent(usize),

  /// A new string, never before encountered.
  /// Must be followed by a literal string.
  New
}

Now, whenever we encounter a string, we add it to the following in-memory tables. Both tables will let us find how to represent a string, when we next encounter it, using either Symbol::FromStart and Symbol::FromCurrent:

pub struct State {
  /// First index of a given string. When we use this, we try and keep numbers small.
  first: HashMap<Rc<String>, usize>,

  /// Latest index of a given string.
  latest: HashMap<Rc<String>, usize>,

  // ...
};

We then add statistics, to find out which is best representation of a string

pub struct State {
  // ...

  /// A mapping from `index` to number of times we have used
  /// `Symbol::BuiltinDictionary(index)`.
  frequency_built_in: VecMap<usize>,

  /// A mapping from `index` to number of times we have used
  /// `Symbol::FromStart(index)`.
  frequency_from_start: VecMap<usize>,

  /// A mapping from `index` to number of times we have used
  /// `Symbol::FromCurrent(index)`.
  frequency_from_latest: VecMap<usize>,
}

With these two pieces of information (first/latest and frequency_*), we may find, for each string, the most common symbol we may use to represent it.

The text was updated successfully, but these errors were encountered:

Yoric added the Experiment: Model label Aug 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A model for strings #1

A model for strings #1

Yoric commented Aug 31, 2018 •

edited

Loading

A model for strings #1

A model for strings #1

Comments

Yoric commented Aug 31, 2018 • edited Loading

Yoric commented Aug 31, 2018 •

edited

Loading