Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grapheme clusters and emoji sequences #117

Open
ghost opened this issue May 29, 2021 · 7 comments
Open

Grapheme clusters and emoji sequences #117

ghost opened this issue May 29, 2021 · 7 comments
Labels
NYI/NewFeat Not yet implemented or New Feature question Question / Mis-usage

Comments

@ghost
Copy link

ghost commented May 29, 2021

ble version: 0.4.0-devel3+301d40f
Bash version: 5.1.8(1)-release (x86_64-pc-linux-gnu)
Emoji font: ttf-twemoji 13.0.1-1

This issue is a bit different depending on the terminal and font patches, but I'll try to explain it; unfortunately I couldn't get my recordings attached on this post 🙁

After an emoji is typed (or pasted), typing more characters (or backspacing) will make the first character in the current word to be reprinted and the character typed to be swapped with the previous one, so if one types hello world 🎃, moves the cursor to the r and types a for instance, the w will be reprinted and it will show hello wwoarld 🎃. Typing at the beginning of a word treats the previous space separated field as its word, meaning if one types , before the w in hello world 🎃 (←3 spaces between hello and world), it will show hhello , world 🎃.

This reprinting is not an editable character, it's just a printing; and as said before, typing or backspacing when the emoji is around will continue to cause the behavior, but when the emoji is deleted it will no longer cause the issue. If the statement with the emoji is executed and is now in ble's history, when the autocompletion for that statement shows up, it will cause the issue again.

Also, as soon as another word is detected after the word with the emoji, the issue dissapears, so it one would type 🎃 bye, the issue dissapears after typing the b, so one would see 🎃🎃🎃 bye, and keep typing normally. A single quote after the emoji (🎃') will not make spacing detect a new word. A double quote after the emoji (🎃") will make the problem stop until the closing double quote is typed. If the emoji is preceded by an opening double quote however, moving the cursor will also cause the reprinting, so if one types echo "🎃 and then moves the cursor backwards, it would see echo """"""""🎃 where the echo "" is just a printing and the actual characters were overwritten with """""

Now, this is the behavior with most emojis, but with some like ♠️♥️♦️♣️♟️, the character typed gets moved forward and the previous rest of the word gets printed behind (and the cursor gets moved too), so from hello world ♟️, typing a after the r would result in helloworlad ♟️, and so on. This doesn't happen with their ♠♥♦♣♟ counterparts.

Finally, flag emojis (e.g. 🇧🇬) don't have this problem in most terminals I tested, which is interesting since the 2 emojis that compose a flag emoji (e.g. 🇧 🇬) do have the problem individually

I'll leave the following outputs from cat -A <<< "[EMOJI]" and bat -A <<< "[EMOJI]"
🎃 M-pM-^_M-^NM-^C$ \u{1f383}␊
🙄 M-pM-^_M-^YM-^D$ \u{1f644}␊
😱 M-pM-^_M-^XM-1$ \u{1f631}␊
👻 M-pM-^_M-^QM-;$ \u{1f47b}␊
♠️ M-bM-^YM- M-oM-8M-^O$ \u{2660}\u{fe0f}␊
♥️ M-bM-^YM-%M-oM-8M-^O$ \u{2665}\u{fe0f}␊
♦️ M-bM-^YM-&M-oM-8M-^O$ \u{2666}\u{fe0f}␊
♣️ M-bM-^YM-#M-oM-8M-^O$ \u{2663}\u{fe0f}␊
♟️ M-bM-^YM-^_M-oM-8M-^O$ \u{2663}\u{fe0f}␊
M-bM-^YM- $ \u{2660}␊
M-bM-^YM-%$ \u{2665}␊
M-bM-^YM-&$ \u{2666}␊
M-bM-^YM-#$ \u{2663}␊
M-bM-^YM-^_$ \u{265f}␊
🇧🇬 M-pM-^_M-^GM-'M-pM-^_M-^GM-,$ \u{1f1e7}\u{1f1ec}␊
🇧 M-pM-^_M-^GM-'$ \u{1f1e7}␊
🇬 M-pM-^_M-^GM-,$ \u{1f1ec}␊

@akinomyoga
Copy link
Owner

This issue is a bit different depending on the terminal and font patches,

Yes. It depends on the terminal and its setting. Also, ble.sh doesn't support grapheme clusters.

I have several questions:

  • Q1: When you input the emoji in ble.sh, is the character shape correctly printed on the terminal?
  • Q2: Does the emoji occupy two cells of the terminal?
  • Q3: What is your terminal?
  • Q4: What is the output of the following commands?
$ bleopt emoji_@ char_width_mode
$ declare -p _ble_util_c2w_auto_width
$ ble/util/s2chars 🎃
$ echo "${ret[*]}"
$ for c in "${ret[@]}"; do ble/util/c2w "$c"; echo "w=$ret"; done

♠️ M-bM-^YM- M-oM-8M-^O$ \u{2660}\u{fe0f}␊
♥️ M-bM-^YM-%M-oM-8M-^O$ \u{2665}\u{fe0f}␊
♦️ M-bM-^YM-&M-oM-8M-^O$ \u{2666}\u{fe0f}␊
♣️ M-bM-^YM-#M-oM-8M-^O$ \u{2663}\u{fe0f}␊
♟️ M-bM-^YM-^_M-oM-8M-^O$ \u{2663}\u{fe0f}␊
🇧🇬 M-pM-^_M-^GM-'M-pM-^_M-^GM-,$ \u{1f1e7}\u{1f1ec}␊

ble.sh currently doesn't support these grapheme clusters and emoji sequences because it's technically involved. Even if 🇧🇬 seemed to work, I think it still causes problems with line wrap. Maybe I try to support variational selectors by setting its width as 0 but I'm not sure for now if it doesn't cause other problems.

@ghost
Copy link
Author

ghost commented May 30, 2021

* **Q3**: What is your terminal?

I'm testing in Gnome-terminal, Konsole, Terminator, Alacritty and Kitty

* **Q1**: When you input the emoji in `ble.sh`, is the character shape correctly printed on the terminal?

I assume you're not exactly asking if the font looks as an emoji, but I have the CBDT/CBLC ttf-twemoji font which renders most emojis in terminals correctly out of the box (only Kitty prints flag emojis correctly, but maybe it's my configuration) with ble detached. Now, when ble is attached and the emoji is pasted, in Konsole, Kitty and Alacritty the character shape is correctly printed after pasting and after typing. In Gnome-Terminal and Terminator, the shape is also printed correctly if it's not preceded by a single ' or double quote ", otherwise the character shape is correctly printed after pasting but dissapears after typing (so in echo "🙄", the emoji disappears in the closing "). When a statement like echo 🙄 is executed, the emoji shape is always correctly printed in stdout.

The character shape of the ♠️♥️♦️♣️♟️ emojis in Konsole, Kitty and Alacritty are also correctly printed, but in Gnome-Terminal and Terminator they don't appear after pasting and after typing if preceded by ' or ". Flag emojis, either composed 🇧🇬 or separated 🇧 🇬 are always correctly shaped in all terminals (but as said, only font rendered correctly in Kitty).

* **Q2**: Does the emoji occupy two cells of the terminal?

In Konsole, Alacritty and Kitty yes; in Gnome-Terminal and Terminator no.

* **Q4**: What is the output of the following commands?
$ bleopt emoji_@ char_width_mode
bleopt emoji_version=13.1
bleopt emoji_width=1
bleopt char_width_mode=auto
$ declare -p _ble_util_c2w_auto_width
declare -- _ble_util_c2w_auto_width="1"
$ ble/util/s2chars 🎃
$ echo "${ret[*]}"
127875
$ for c in "${ret[@]}"; do ble/util/c2w "$c"; echo "w=$ret"; done
w=1

Same output in all terminals; when doing it with ble/util/s2chars 🇧🇬, there are 2 characters of course, so 127463 127468 and w=1 w=1 are the ouputs of the other commands.

ble.sh currently doesn't support these grapheme clusters and emoji sequences because it's technically involved. Even if 🇧🇬 seemed to work, I think it still causes problems with line wrap.

Line wrapping problems with flag emojis only occur in Konsole, but not in the other terminals because Konsole is the only terminal with the reprinting problem with flag emojis. Btw, line wrapping problems also don't occur with ♠♥♦♣♟, they only appear along with the reprinting problem.

@akinomyoga
Copy link
Owner

akinomyoga commented May 30, 2021

OK! Thank you for your answers!

* **Q2**: Does the emoji occupy two cells of the terminal?

In Konsole, Alacritty and Kitty yes; in Gnome-Terminal and Terminator no.

Does it mean an emoji occupy one cell in GNOME Terminal and Terminator? If so, you need to set bleopt emoji_width=2 in Konsole, Alacritty and Kitty, and bleopt emoji_width=1 in GNOME Terminal and Terminator.

Edit: I've tried GNOME Terminal and Terminator, but they also behave as bleopt emoji_width=2. I think we should always use emoji_width=2 for the terminals with the emoji support.

* **Q1**: When you input the emoji in `ble.sh`, is the character shape correctly printed on the terminal?

I assume you're not exactly asking if the font looks as an emoji,

Ah, yes. I actually wanted to confirm that ble.sh receives the emoji correctly. If ble.sh fails to decode emoji in the user input, it will insert different characters in the command line string and print the different characters to the terminal. From your description and the output of the commands you provided, I think ble.sh correctly receives the emoji characters. So the problem is solely in the cursor position calculation of the output phase.

* **Q4**: What is the output of the following commands?
$ bleopt emoji_@ char_width_mode
bleopt emoji_version=13.1
bleopt emoji_width=1
bleopt char_width_mode=auto

The outputs are expected ones except for emoji_width. As I have mentioned above, you need to set emoji_width to the value corresponding to the terminal behavior.

Optionally, you may set emoji_version=13.0 since you seem to use ttf-twemoji 13.0.1-1 which is a font based on "Unicode 13.0 Emoji". Or, you may update the font to 13.1. It seems twemoji 13.1 has been released just two days before. Of course, the terminals you use also need to support 13.0 or 13.1.

ble.sh currently doesn't support these grapheme clusters and emoji sequences because it's technically involved. Even if 🇧🇬 seemed to work, I think it still causes problems with line wrap.

Line wrapping problems with flag emojis only occur in Konsole, but not in the other terminals because Konsole is the only terminal with the reprinting problem with flag emojis. Btw, line wrapping problems also don't occur with ♠♥♦♣♟, they only appear along with the reprinting problem.

Hmm, OK. I think it is also related to the terminal behavior. ble.sh doesn't recognize any grapheme clusters composed of multiple Unicode code points, so when the flag emoji is printed at the last column of the terminal, the two constituent code points XY may be placed differently in the internal ble.sh logic and in the actual terminal.

(A) Assumption in ble.sh (which treats X and Y as independent characters)
+--------------------+
|                   X|
|Y                   |
+--------------------+

(B) Actual terminal that treats XY as a grapheme cluster in the layout phase
+--------------------+
|                    |
|XY                  |
+--------------------+

But some terminals may behave as (A) in the layout phase and only resolve emojis in the rendering phase. In that case, the problem doesn't occur since the behavior matches with ble.sh's assumption.

@ghost
Copy link
Author

ghost commented May 30, 2021

* **Q2**: Does the emoji occupy two cells of the terminal?

In Konsole, Alacritty and Kitty yes; in Gnome-Terminal and Terminator no.

I rechecked, and actually most emojis like 🎃 are 2 cells long in all terminals, I was trying with the ♠️♥️♦️♣️♟️ 🇧 🇬 emojis and those are the ones that are 2 cells long in Konsole, Alacritty and Kitty and 1 cell long in Gnome-Terminal and Terminator.

If so, you need to set bleopt emoji_width=2 in Konsole, Alacritty and Kitty, and bleopt emoji_width=1 in GNOME Terminal and Terminator.

Yeah!! That solved the reprinting problem with most emojis, thanks!! I forgot about it in blerc. The issues that still persist are the reprinting and line wrapping of grapheme clusters ♠️♥️♦️♣️♟️ (and 🇧 🇬 🇧🇬 in Konsole), and the dissapearance of emojis after quotes with Gnome-Terminal and Terminator

Optionally, you may set emoji_version=13.0 since you seem to use ttf-twemoji 13.0.1-1 which is a font based on "Unicode 13.0 Emoji". Or, you may update the font to 13.1. It seems twemoji 13.1 has been released just two days before. Of course, the terminals you use also need to support 13.0 or 13.1.

Oh thanks for the suggestion, but changing the version didn't seem to solve anything itself. I'll keep it up to date in any case.

Hmm, OK. I think it is also related to the terminal behavior. ble.sh doesn't recognize any grapheme clusters composed of multiple Unicode code points, so when the flag emoji is printed at the last column of the terminal, the two constituent code points XY may be placed differently in the internal ble.sh logic and in the actual terminal.

(A) Assumption in ble.sh (which treats X and Y as independent characters)
+--------------------+
|                   X|
|Y                   |
+--------------------+

(B) Actual terminal that treats XY as a grapheme cluster in the layout phase
+--------------------+
|                    |
|XY                  |
+--------------------+

Oh, maybe I was referring to line wrapping of the autocompletion. When an emoji has the reprinting problem and the autosuggestion exceeds the last column, it reprints it below the current line and messes up the cursor position as well, something like

 +--------------------+
 |🎃aaaaaaaaaaaaaaaaaa|
 |a  ▮                |
 |a                   |
 +--------------------+

As bleopt emoji_width=2 solves the reprinting problem for most emojis, it doesn't happen anymore for those, just for the grapheme clusters. As for the example you mentioned, it is indeed what happens for most terminals, just Kitty does the line wrapping like this:

+--------------------+
|                    |
|X                   |
|Y                   |
+--------------------+

Thanks again

@akinomyoga
Copy link
Owner

I rechecked, and actually most emojis like 🎃 are 2 cells long in all terminals, I was trying with the ♠️♥️♦️♣️♟️ 🇧 🇬 emojis and those are the ones that are 2 cells long in Konsole, Alacritty and Kitty and 1 cell long in Gnome-Terminal and Terminator.

Yeah, treatment of grapheme clusters and their components is the are that the behavior of terminals and applications differ from one another the most. The different levels of conformance to the Unicode standard come from the technical difficulty of implementing the full Unicode specification.

The issues that still persist are the reprinting and line wrapping of grapheme clusters ♠️♥️♦️♣️♟️ (and 🇧 🇬 🇧🇬 in Konsole), and the dissapearance of emojis after quotes with Gnome-Terminal and Terminator

Well, they are all related to the grapheme clusters that ble.sh doesn't support.

Oh, maybe I was referring to line wrapping of the autocompletion. When an emoji has the reprinting problem and the autosuggestion exceeds the last column, it reprints it below the current line and messes up the cursor position as well, something like

 +--------------------+
 |🎃aaaaaaaaaaaaaaaaaa|
 |a  ▮                |
 |a                   |
 +--------------------+

As bleopt emoji_width=2 solves the reprinting problem for most emojis, it doesn't happen anymore for those, just for the grapheme clusters. As for the example you mentioned, it is indeed what happens for most terminals, just Kitty does the line wrapping like this:

+--------------------+
|                    |
|X                   |
|Y                   |
+--------------------+

Hmm, I think that is kitty's glitch. Maybe I can support grapheme clusters someday, but I will never support kitty's behavior...

I also checked the behavior of other shells' line editors. It seems that readline recognizes the grapheme clusters and works well in GNOME Terminal (but not in kitty). Zsh avoids handling the grapheme clusters directly but instead shows an ASCII representation of variation selector as <fe0f>. Fish 2.7.1 doesn't work at all in my environment both in kitty and GNOME terminal. I also tried set fish_emoji_width 2 but it didn't change anything. I found this issue fish-shell/fish-shell#5583, so it's just because my fish (in Ubuntu 18 LTS) is too old.

@akinomyoga akinomyoga added NYI/NewFeat Not yet implemented or New Feature question Question / Mis-usage labels May 30, 2021
@akinomyoga akinomyoga changed the title Emojis mess up autocompletion Grapheme clusters and emoji sequences Jun 1, 2021
@ghost
Copy link
Author

ghost commented Jun 3, 2021

The issues that still persist are the reprinting and line wrapping of grapheme clusters ♠️♥️♦️♣️♟️ (and 🇧 🇬 🇧🇬 in Konsole), and the dissapearance of emojis after quotes with Gnome-Terminal and Terminator

Well, they are all related to the grapheme clusters that ble.sh doesn't support.

Even all emojis inside quotes dissapearing? I also found that when that happens, if an autosuggestion appears inside those quotes, the emoji reappears, but well if it's as you say, there's not much to do.

I also checked the behavior of other shells' line editors. It seems that readline recognizes the grapheme clusters and works well in GNOME Terminal (but not in kitty). Zsh avoids handling the grapheme clusters directly but instead shows an ASCII representation of variation selector as <fe0f>. Fish 2.7.1 doesn't work at all in my environment both in kitty and GNOME terminal. I also tried set fish_emoji_width 2 but it didn't change anything. I found this issue fish-shell/fish-shell#5583, so it's just because my fish (in Ubuntu 18 LTS) is too old.

I did notice that <fe0f> in zsh inside Konsole, Alacritty and Terminator but not in my GNOME Terminal, there zsh actually renders it correctly but the following character is a bit buggy. In fish I noticed grapheme clusters seem to mess fish's fish_right_prompt function. It seems that in those shells Konsole is the most buggy terminal, it causes reprinting issues in zsh similar to what I described previously, and reprints new lines of the prompt in fish. If I remember correctly, Konsole got emoji support not so long ago, and font rendering is not perfect, it doesn't show my twemoji font, instead prints some other font. Just something to keep in mind if support ever comes in ble.

@akinomyoga
Copy link
Owner

The issues that still persist are the reprinting and line wrapping of grapheme clusters ♠️♥️♦️♣️♟️ (and 🇧 🇬 🇧🇬 in Konsole), and the dissapearance of emojis after quotes with Gnome-Terminal and Terminator

Well, they are all related to the grapheme clusters that ble.sh doesn't support.

Even all emojis inside quotes dissapearing? I also found that when that happens, if an autosuggestion appears inside those quotes, the emoji reappears, but well if it's as you say, there's not much to do.

OK. Actually, I cannot reproduce this behavior in my GNOME Terminal. What is the version of your GNOME terminal? Maybe I also try Terminator later.

I did notice that <fe0f> in zsh inside Konsole, Alacritty and Terminator but not in my GNOME Terminal, there zsh actually renders it correctly but the following character is a bit buggy.

Hm, OK. Zsh is clever enough to switch the behavior depending on the terminal. My naive guess is that Konsole, Alacritty and Terminator implement their own width determination of emoji characters and sequences, but GNOME terminal uses the system wcwidth the same as zsh.

In fish I noticed grapheme clusters seem to mess fish's fish_right_prompt function. It seems that in those shells Konsole is the most buggy terminal, it causes reprinting issues in zsh similar to what I described previously, and reprints new lines of the prompt in fish. If I remember correctly, Konsole got emoji support not so long ago, and font rendering is not perfect, it doesn't show my twemoji font, instead prints some other font.

OK, thanks for the information. Yeah, this is one of the messiest areas in terminals. I remember the discussion at Terminal WG #9.

Just something to keep in mind if support ever comes in ble.

I currently have two different approaches in my mind. (a) One approach is to treat clusters as one character in text editing. For example, pressing delete after a grapheme cluster deletes the entire cluster, (b) Another approach is that we don't change the text editing but just change how they are laid out in terminals. In this case, pressing delete after e.g. ♥️ will just delete a variation selector and turn it into a plain .

  • Although UAX #29 seems to recommend approach (a) for the uniform user experience, it largely affects the implementation in ble.sh because we need to change all the editing features. And also, proper handling of the grapheme cluster detection and counting adds a large overhead. For example, The Bash feature ${str:offset:length} cannot be used anymore if one wants to operate on Unicode grapheme clusters rather than Unicode codepoints. So one needs to count the character in Bash script one-by-one detecting grapheme clusters. One possible workaround for this issue is to replace a grapheme cluster with a private character (in one Unicode codepoint) in the new "internal representation" of ble.sh. But we still need to detect grapheme clusters in inserting characters/strings. This approach requires drawing a clear boundary between the internal representation and the real command line strings inside ble.sh, and adding conversion codes at every place where data are passed between to different representations.
  • The latter approach is much easier in my opinion because it only affects the layout algorithm, but some other issues may turn out when I started to implement it. I'm not sure.

Also, I need to support grapheme clusters and emoji sequences in prompts separately. The layout of prompts is treated in different logic because they are static texts, unlike the command line strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NYI/NewFeat Not yet implemented or New Feature question Question / Mis-usage
Projects
None yet
Development

No branches or pull requests

1 participant