Fix Tokenizer.prototype.tokenizeFrom string length after normalizing #1628
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request addresses #1627, in which I was getting strange bugs on a particular character.
Currently
tokenizeFrom
normalizes the given source string to Unicode Normalization Form C, then stores the original string's length in a separate variablethis.len
. However, calling.normalize()
on a string can change its length, so its necessarythis.len
reflects the length of the newly normalized string to avoid lexing errors.This issue turns out to be pretty prevalent, and I've found a slew of characters that cause the same error in Pyret right now. Below is a small sample of them that I found with a small script, but there are a lot more (even common characters with accent marks, like é, may have this issue).
In addition to fixing this issue by simply having
this.len
reflect the normalized string's length, I've also written two tests in the areas that I've found this to be an issue, namely block comments and string literals. If they're misplaced / unnecessary / not enough I can certainly change them!