Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Tokenizer.prototype.tokenizeFrom string length after normalizing #1628

Open
wants to merge 2 commits into
base: horizon
Choose a base branch
from

Conversation

brandon-gong
Copy link

This pull request addresses #1627, in which I was getting strange bugs on a particular character.

Currently tokenizeFrom normalizes the given source string to Unicode Normalization Form C, then stores the original string's length in a separate variable this.len. However, calling .normalize() on a string can change its length, so its necessary this.len reflects the length of the newly normalized string to avoid lexing errors.

This issue turns out to be pretty prevalent, and I've found a slew of characters that cause the same error in Pyret right now. Below is a small sample of them that I found with a small script, but there are a lot more (even common characters with accent marks, like é, may have this issue).

'̈́क़ख़ग़ज़ड़ढ़फ़य़ড়ঢ়য়ਲ਼ਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ଡ଼ଢ଼གྷཌྷདྷབྷཛྷཀྵཱཱིུྲྀླཱྀྀྒྷྜྷྡྷྦྷྫྷྐྵ⫝̸𤋮𢡊𢡄𣏕𥉉𥳐𧻓יִײַשׁשׂשּׁשּׂאַאָאּבּגּדּהּוּזּטּיּךּכּלּמּנּסּףּפּצּקּרּשּתּוֹבֿכֿפֿ𝅗𝅥𝅘𝅥𝅘𝅥𝅮𝅘𝅥𝅯𝅘𝅥𝅰𝅘𝅥𝅱𝅘𝅥𝅲𝆹𝅥𝆺𝅥𝆹𝅥𝅮𝆺𝅥𝅮𝆹𝅥𝅯𝆺𝅥𝅯 क़ ख़ ग़ ज़ ड़ ढ़ फ़ य़ ড় ঢ় য় ਲ਼ ਸ਼ ਖ਼ ਗ਼ ਜ਼ ਫ਼ ଡ଼ ଢ଼ གྷ ཌྷ དྷ བྷ ཛྷ ཀྵ ཱི ཱུ ྲྀ ླྀ ཱྀ ྒྷ ྜྷ ྡྷ ྦྷ ྫྷ ྐྵ ⫝̸ 𤋮 𢡊 𢡄 𣏕 𥉉 𥳐 𧻓 יִ ײַ שׁ שׂ שּׁ שּׂ אַ אָ אּ בּ גּ דּ הּ וּ זּ טּ יּ ךּ כּ לּ מּ נּ סּ ףּ פּ צּ קּ רּ שּ תּ וֹ בֿ כֿ פֿ 𝅗𝅥 𝅘𝅥 𝅘𝅥𝅮 𝅘𝅥𝅯 𝅘𝅥𝅰 𝅘𝅥𝅱 𝅘𝅥𝅲 𝆹𝅥 𝆺𝅥 𝆹𝅥𝅮 𝆺𝅥𝅮 𝆹𝅥𝅯 𝆺𝅥𝅯 

In addition to fixing this issue by simply having this.len reflect the normalized string's length, I've also written two tests in the areas that I've found this to be an issue, namely block comments and string literals. If they're misplaced / unnecessary / not enough I can certainly change them!

@blerner
Copy link
Member

blerner commented Nov 7, 2021

Thanks! As you saw in the comment on the line you changed -- the length property here isn't unicode-aware, for sure. I'm pretty sure I used the original length of the string because the lexer iterates character-by-character, and needs to supply source locations to tokens in such a way that sourcestring.substring(start, end) correctly extracts the entirety of the token, and at least at the time I was writing the lexer, the normalized length was wrong.

This is a particularly fiddly property to get right (see https://hsivonen.fi/string-length/, for an amusing example) and I mostly just punted on this when originally writing the lexer. Pyret gets this example weird, since CodeMirror doesn't handle the characters consistently with how they're output, either:
image
I genuinely don't know what the best thing to do here is.

@brandon-gong
Copy link
Author

Thanks for the article, I thought it was an interesting read!

I really don't understand the consequences of normalizing enough to make any serious argument for one way or another, but from my newbie point of view, the current behavior (of these characters causing errors in code) can cause unnecessary confusion especially for students less comfortable with Pyret because they're far more likely to believe they made a mistake in their code somewhere rather than some magic character sitting in a string or comment causing issues. (I'm speaking from personal experience 😅).

Also, I'm not sure how/why it changed since you wrote the lexer, but it looks like using the original length causes substring to not extract the whole token now, as that normalized-length-2 character causes Pyret to not parse string-length all the way? This screenshot is from code.pyret.org this morning:
image

Separately, I'm also curious -- what happens if we don't deal with normalizing the string at all? Everything still passes with make test on my computer, but I'm not sure if problems come up with CodeMirror or the browser doing anything to text input. This has the added benefit of allowing for string-length("𢡊") == 1, which is less of a "lie" than quietly normalizing it and saying string-length("𢡊") == 2. And maybe we avoid some Unicode quirks as well.

Anyway, I'm definitely out of my area of expertise here. Thanks so much for taking the time to review my pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants