Consider string normalisation #18

sidju · 2023-10-13T09:36:26Z

As different representations of the same visible string should match in regex operations it would probably be good to normalise both buffer contents and commands.

Details exist in here: https://tonsky.me/blog/unicode/

sidju · 2023-10-13T13:14:28Z

The probably best way to implement this is by:

Make the Line and PubLine constructors do string normalisation (so the buffer and clipboard are always normalised, no matter which IO implementation or UI provides the data in which format)
Normalise all command strings at the start of crate::cmd::run()
Optionally normalise UI input in functions, but not actually needed since input is always one of:
- Input to the buffer, which will be normalised by the Line constructor
- Input to 'g'/'v'/'G'/'V', which will be normalised when they are fed as macro input to the editor later.

sidju · 2023-10-16T07:44:06Z

NFC string normalisation is probably best suited, as combining codepoints where possible would minimise file size and reduce complexity of width calculations in UI:s. (Even when you do the right thing and use unicode-width it will do a lookup per codepoint, so reducing the number of codepoints is an improvement. And if you do the wrong thing and assume every codepoint is 1 character you are wrong by less with NFC normalisation.)

sidju · 2023-10-25T07:03:10Z

As this could cause issues it should be possible to disable without recompilation, as a workaround. This requires handing around some manner of flag to all code locations where normalisation would occur, so that either all or no text is normalised.

This becomes very tricky if we do normalisation in Line and PubLine, as those public APIs would make it easy for a library user to accidentally introduce normalised/non-normalised data when the editor is configured not to. Not a nice solution, as it offers such a footgun.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider string normalisation #18

Consider string normalisation #18

sidju commented Oct 13, 2023

sidju commented Oct 13, 2023

sidju commented Oct 16, 2023

sidju commented Oct 25, 2023

Consider string normalisation #18

Consider string normalisation #18

Comments

sidju commented Oct 13, 2023

sidju commented Oct 13, 2023

sidju commented Oct 16, 2023

sidju commented Oct 25, 2023