Tokenize on byte offsets instead of character indexes #7

brunojppb · 2024-05-20T21:11:56Z

Description

I was initially indexing on characters assuming everything was ASCII (one byte per char), but this is an obvious problem in rust where the &str type guarantees UTF-8 encoding and some characters are longer than one byte, which throws a spanner in the works when I do something like source[idx] which in this case indexes on the byte-offset of the character and it would panic with something like:

thread 'rustc' panicked at 'byte index 4 is not a char boundary; it is inside '🤡' (bytes 0..4) of...'

By indexing on byte offsets, I can safely rely on ASCII as usual within the UTF-8 encoded strings to look-up for Markdown symbols, but still handle multi-byte characters like emojis as normal text.

A bit on graphemes and Grapheme clusters

A grapheme cluster is a sequence of one or more Unicode code points that should be treated as a single unit by various processes. This is specially important for emojis where multiple emojis can form a different grapheme cluster.

Let's take this example into account:

use unicode_width::UnicodeWidthStr;

fn main() {
    let family = "👨‍👩‍👦";
    println!("{} {}", family.width_cjk(), UnicodeWidthStr::width(family));
    println!("{family}");
    println!("{:X<width$}", "", width = family.width_cjk());
}

The family variable is actually a combination of the following emojis:

👨
👩
👶

And when they are followed in this order, they should render the "👨‍👩‍👦" emoji instead with a width of 6. And the emojis on themselves are represented as multi-byte characters.

Reading Material

Joel Spolsky has a a really good article about this here:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Rust lang forum thread with some tips on how to handle this.

Tracking position in unicode-enabled lexer, best practice?

brunojppb added 2 commits May 20, 2024 23:04

Tokenize on byte offsets instead of character indexes

ef87d08

Remove debug

7aa7bc1

brunojppb force-pushed the multi-byte-chars branch from be86b3e to 7aa7bc1 Compare May 20, 2024 21:21

brunojppb merged commit 6b97716 into main May 20, 2024
3 checks passed

brunojppb deleted the multi-byte-chars branch June 1, 2024 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize on byte offsets instead of character indexes #7

Tokenize on byte offsets instead of character indexes #7

brunojppb commented May 20, 2024 •

edited

Loading

Tokenize on byte offsets instead of character indexes #7

Tokenize on byte offsets instead of character indexes #7

Conversation

brunojppb commented May 20, 2024 • edited Loading

Description

A bit on graphemes and Grapheme clusters

Reading Material

brunojppb commented May 20, 2024 •

edited

Loading