Skip to content

Latest commit

 

History

History
220 lines (176 loc) · 7.55 KB

UAX31-EWG-slides.org

File metadata and controls

220 lines (176 loc) · 7.55 KB

C++ Identifiers using UAX 31

C++ Identifier Syntax using Unicode Standard Annex 31

  • That C++ identifiers match the pattern

    (XID_Start + _ ) + XID_Continue*.

  • That portable source is required to be normalized as NFC.
  • That using unassigned code points be ill-formed.

Problem this fixes : NL 029

Allowed characters include those from U+200b until U+206x; these are zero-width and control characters that lead to impossible to type names, indistinguishable names and unusable code & compile errors (such as those accidentally including RTL modifiers).

Status Quo: we allow other “weird identifier code points”

  • The middle dot · which looks like an operator.
  • Many non-combining “modifiers” and accent marks, such as ´ and ¨ and ꓻ which don’t really make sense on their own.
  • “Tone marks” from various languages, including ˫ (similar to a box-drawing character ├ which is an operator).
  • The “Greek question mark” ;
  • Symbols which are simply not linguistic, such as ۞ and ༒.

https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59#weird-identifier-code-points

UAX 31 - Unicode Identifier and Pattern Syntax

  • Follows the same principles as originally used for C++
  • Actively maintained
  • Stable

XID_Start and XID_Continue

  • Unicode database defined properties
  • Closed under normalization for all four forms
  • Once a code point has the property it is never removed
  • Roughly:
    • Start == letters
    • Continue == Start + numbers + some punctuation

The Emoji Problem

  • The emoji-like code points that we knew about were excluded
  • We included all unassigned code points
  • Status Quo Emoji ‘support’ is an accident, incomplete, and broken

Status quo is broken

Some Status Quo examples

Not ValidValid
int ⏰ = 0;int 🕐 = 0;
int ☠️ = 0;int 💀 = 0;
int ✋️ = 0;int 👊 = 0;
int ✈️ = 0;int 🚀 = 0;
int ☹️ = 0;int 😀 = 0;

When the character was added to Unicode controls validity

Status Quo: ♀ and ♂ are disallowed

Gendered variants of emoji are selected by using a zero width joiner together with the male and female sign.

// Valid
    bool 👷 = true; //  Construction Worker
// Not valid
    bool 👷‍♀ = false; // Woman Construction Worker ({Construction Worker}{ZWJ}{Female Sign})

Problems adding Emoji as identifiers

Emoji are complex

  • Not just code points
  • Need grapheme cluster analysis
  • May incur costs even for code not using emoji

Emoji are not “Stable” in Unicode

From the emoji spec

isEmoji(♟)=false for Emoji Version 5.0, but true for Version 11.0.

It is possible that the emoji property could be removed.

Identifying Emoji is difficult

The unicode standard provides a regex that will reject non-emoji, but does not guarantee a valid emoji sequence.

\p{RI} \p{RI}
| \p{Emoji}
    ( \p{EMod}
    | \x{FE0F} \x{20E3}?
    | [\x{E0020}-\x{E007E}]+ \x{E007F} )?
    (\x{200D} \p{Emoji}
      ( \p{EMod}
      | \x{FE0F} \x{20E3}?
      | [\x{E0020}-\x{E007E}]+ \x{E007F} )?
    )*

It’s not clear how much of the unicode database would be required for complete support.

UNICODE EMOJI

Some surprising things are emoji

002A          ; Emoji                # E0.0   [1] (*️)       asterisk
0030..0039    ; Emoji                # E0.0  [10] (0️..9️)    digit zero..digit nine
{DIGIT ONE}{VARIATION SELECTOR-16}{COMBINING ENCLOSING KEYCAP} 1️⃣

{ASTERISK}{VARIATION SELECTOR-16}{COMBINING ENCLOSING KEYCAP} *️⃣
/// would this be valid?
int 1️⃣ = 1;

Fixing the emoji problem would mean being inventive

Being inventive in an area outside our expertise is HARD

Adopting UAX31 as a base to move forward is conservative

UAX 31 is a known good state

Script Issues

Some scripts require characters to control display or require punctuation that are not in the identifier set.

This includes English

  • Apostrophe and dash
    • won't
    • can't
    • mustn't
    • mother-in-law
  • Programmers are used to this and do not notice

Zero Width characters are excluded by UAX 31

Status quo allows these invisible characters

int tmp = 0;
int t‍‍mp = 0;
  • clang 10 warns

    <source>:2:6: warning: identifier contains Unicode character <U+200D> that is invisible in some environments [-Wunicode-zero-width]

int t<U+200D><U+200D>mp = 0;

ZWJ and ZWNJ

However zero width joiner and non joiner are used in some scripts

Farsi word “names”
نامهای
NOON + ALEF + MEEM + HEH + ALEF + FARSI YEH
https://www.unicode.org/reports/tr31/images/uax31-figure-2-farsi-ex1-v1-web.jpg
Farsi word “a letter”
نامه‌ای
NOON + ALEF + MEEM + HEH + ZWNJ + ALEF + FARSI YEH
https://www.unicode.org/reports/tr31/images/uax31-figure-2-farsi-ex2-v1-web.jpg

Anecdotally, these issues are understood and worked around

UAX 31 has an expensive solution

Identifiers can be checked for what script the code points in the identifier are used, and the rules for allowed characters can be tailored. This requires a Unicode database and would require extensive analysis during lexing.

SG 16 does not recommend this.

Other adopters