Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider string normalisation #18

Open
sidju opened this issue Oct 13, 2023 · 3 comments
Open

Consider string normalisation #18

sidju opened this issue Oct 13, 2023 · 3 comments

Comments

@sidju
Copy link
Owner

sidju commented Oct 13, 2023

As different representations of the same visible string should match in regex operations it would probably be good to normalise both buffer contents and commands.

Details exist in here: https://tonsky.me/blog/unicode/

@sidju
Copy link
Owner Author

sidju commented Oct 13, 2023

The probably best way to implement this is by:

  • Make the Line and PubLine constructors do string normalisation (so the buffer and clipboard are always normalised, no matter which IO implementation or UI provides the data in which format)
  • Normalise all command strings at the start of crate::cmd::run()
  • Optionally normalise UI input in functions, but not actually needed since input is always one of:
    • Input to the buffer, which will be normalised by the Line constructor
    • Input to 'g'/'v'/'G'/'V', which will be normalised when they are fed as macro input to the editor later.

@sidju
Copy link
Owner Author

sidju commented Oct 16, 2023

NFC string normalisation is probably best suited, as combining codepoints where possible would minimise file size and reduce complexity of width calculations in UI:s. (Even when you do the right thing and use unicode-width it will do a lookup per codepoint, so reducing the number of codepoints is an improvement. And if you do the wrong thing and assume every codepoint is 1 character you are wrong by less with NFC normalisation.)

@sidju
Copy link
Owner Author

sidju commented Oct 25, 2023

As this could cause issues it should be possible to disable without recompilation, as a workaround. This requires handing around some manner of flag to all code locations where normalisation would occur, so that either all or no text is normalised.

This becomes very tricky if we do normalisation in Line and PubLine, as those public APIs would make it easy for a library user to accidentally introduce normalised/non-normalised data when the editor is configured not to. Not a nice solution, as it offers such a footgun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant