Tokenize on byte offsets instead of character indexes #7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
I was initially indexing on characters assuming everything was ASCII (one byte per char), but this is an obvious problem in rust where the
&str
type guarantees UTF-8 encoding and some characters are longer than one byte, which throws a spanner in the works when I do something likesource[idx]
which in this case indexes on the byte-offset of the character and it would panic with something like:By indexing on byte offsets, I can safely rely on ASCII as usual within the UTF-8 encoded strings to look-up for Markdown symbols, but still handle multi-byte characters like emojis as normal text.
A bit on graphemes and Grapheme clusters
A grapheme cluster is a sequence of one or more Unicode code points that should be treated as a single unit by various processes. This is specially important for emojis where multiple emojis can form a different grapheme cluster.
Let's take this example into account:
The
family
variable is actually a combination of the following emojis:And when they are followed in this order, they should render the "👨👩👦" emoji instead with a width of 6. And the emojis on themselves are represented as multi-byte characters.
Reading Material
Joel Spolsky has a a really good article about this here:
Rust lang forum thread with some tips on how to handle this.