Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize on byte offsets instead of character indexes #7

Merged
merged 2 commits into from
May 20, 2024

Conversation

brunojppb
Copy link
Owner

@brunojppb brunojppb commented May 20, 2024

Description

I was initially indexing on characters assuming everything was ASCII (one byte per char), but this is an obvious problem in rust where the &str type guarantees UTF-8 encoding and some characters are longer than one byte, which throws a spanner in the works when I do something like source[idx] which in this case indexes on the byte-offset of the character and it would panic with something like:

thread 'rustc' panicked at 'byte index 4 is not a char boundary; it is inside '🤡' (bytes 0..4) of...'

By indexing on byte offsets, I can safely rely on ASCII as usual within the UTF-8 encoded strings to look-up for Markdown symbols, but still handle multi-byte characters like emojis as normal text.

A bit on graphemes and Grapheme clusters

A grapheme cluster is a sequence of one or more Unicode code points that should be treated as a single unit by various processes. This is specially important for emojis where multiple emojis can form a different grapheme cluster.

Let's take this example into account:

use unicode_width::UnicodeWidthStr;

fn main() {
    let family = "👨‍👩‍👦";
    println!("{} {}", family.width_cjk(), UnicodeWidthStr::width(family));
    println!("{family}");
    println!("{:X<width$}", "", width = family.width_cjk());
}

The family variable is actually a combination of the following emojis:

  • 👨
  • 👩
  • 👶

And when they are followed in this order, they should render the "👨‍👩‍👦" emoji instead with a width of 6. And the emojis on themselves are represented as multi-byte characters.

Reading Material

Joel Spolsky has a a really good article about this here:

Rust lang forum thread with some tips on how to handle this.

@brunojppb brunojppb merged commit 6b97716 into main May 20, 2024
3 checks passed
@brunojppb brunojppb deleted the multi-byte-chars branch June 1, 2024 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant