Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A model for strings #1

Open
Yoric opened this issue Aug 31, 2018 · 0 comments
Open

A model for strings #1

Yoric opened this issue Aug 31, 2018 · 0 comments

Comments

@Yoric
Copy link
Contributor

Yoric commented Aug 31, 2018

I've had an idea of a simple model/encoding for strings, I'd like to test it at some point.

It starts from the remark that any string we encounter more than once may be represented either:

  • as an index from start;
  • as an index from latest string;
  • if the string is part of the AOT dictionary, as an index in that dictionary.

As discussed, some strings tend to have many instances in a given window, while others don't. I suspect that, by picking the best of these three representation, we'll be able to reduce the size.

We represent this as the following alphabet:

enum Symbol {
  /// Well-known string, stored in the dictionary we shipped with the encoder/decoder.
  BuiltInDictionary(usize),

  /// A string already referenced in this file, as indexed from the start.
  /// 0 is the first string encountered in the file, 1 the second, ...
  FromStart(usize),

  /// A string already referenced in this file, as indexed from the current position.
  /// 0 is the latest string encountered in the file, 1 the previous, ...
  FromCurrent(usize),

  /// A new string, never before encountered.
  /// Must be followed by a literal string.
  New
}

Now, whenever we encounter a string, we add it to the following in-memory tables. Both tables will let us find how to represent a string, when we next encounter it, using either Symbol::FromStart and Symbol::FromCurrent:

pub struct State {
  /// First index of a given string. When we use this, we try and keep numbers small.
  first: HashMap<Rc<String>, usize>,

  /// Latest index of a given string.
  latest: HashMap<Rc<String>, usize>,

  // ...
};

We then add statistics, to find out which is best representation of a string

pub struct State {
  // ...

  /// A mapping from `index` to number of times we have used
  /// `Symbol::BuiltinDictionary(index)`.
  frequency_built_in: VecMap<usize>,

  /// A mapping from `index` to number of times we have used
  /// `Symbol::FromStart(index)`.
  frequency_from_start: VecMap<usize>,

  /// A mapping from `index` to number of times we have used
  /// `Symbol::FromCurrent(index)`.
  frequency_from_latest: VecMap<usize>,
}

With these two pieces of information (first/latest and frequency_*), we may find, for each string, the most common symbol we may use to represent it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant