Skip to content

Latest commit

 

History

History
171 lines (126 loc) · 12.7 KB

D1859.md

File metadata and controls

171 lines (126 loc) · 12.7 KB
title document date audience author toc
Standard terminology character sets and encodings
D1859R1
today
SG16
EWG
CWG
name email
Steve Downey
false

Abstract: This document proposes new standard terms for the various encodings for character and string literals, and the encodings associated with some character types. It also proposes that the wording used for [lex.charset], [lex.ccon], [lex.string], and [basic.fundamental.8] be modified to reflect the new terminology. This paper does not intend to propose any changes that would require changes in any currently conforming implementation.

Introduction

In discussions around understanding the current capabilities of C++ and proposing new capabilities and facilities, SG16 has found that the current standard wording is often unclear, and does not match well the language currently used in 10646 and the Unicode Standard. This makes having technical discussions difficult. For example, the phrase "execution encoding" often comes up, or "presumed execution encoding", trying to describe the encodings of char literals and strings as interpreted by the character classification functions. This conflates several concepts, and is not actually standard terminology. It would be useful to have standard terminology that did cover these concepts.

Execution character set is a standard term, however it defines what abstract characters must be included in the character repertoire of the character set used to encode C++, specifically the various kinds of character literals. That character set is a strict superset of the source character set, which defines the abstract characters must be in the character repertoire of the character set used to write C++ source code. The encodings of those character sets are not specified, and in fact there may be several encodings used depending on the context or kind of literal.

There are five encodings that are associated with the five kinds of character literals, corresponding to char, wchar_t, char8_t, char16_t, and char32_t. For 8, 16, and 32, the encodings must be UTF-8, UTF-16, and UTF-32. There are no associated encodings for unsigned char or signed char.

The encoding used for narrow and wide character and string literals is implementation defined, and is, of course, fixed at translation time.

At runtime, however, interpretation of character data is usually controlled by locale, either explicitly, or via the locale specified by setlocale(). The dynamic locale may not be the same as the literal encoding used at translation time. This is a source of errors in text processing.

Another source of problems is the baked in assumption that a single wchar_t can encode any representation character. For ABIs where wchar_t is 16 bits, this is not true, and many of the NTMBS functions are incomplete, as they do not allow for stateful wide character encodings.

Proposal

  • Introduce new, more precise, terms, and use them throughout.
  • Clarfiy when literal vs dynamic encoding is intended, where not already clear.
  • Clarify that wchar_t may be a UTF-16 encoded type.
    • This can be separated from the rest of the proposal. It is certainly the most controversial as it involves admiting the standard is broken in places, albeit without requiring any implementation to change.

Terms

Narrow Literal Encoding and Wide Literal Encoding : The encoding used for character and wide character and string literals in a translation unit.

Dynamic Encoding : The encoding implied by the LC_CTYPE category of the current locale.

Character Set [https://unicode.org/glossary/#character_set] : A collection of elements used to represent textual information.

Abstract Character [https://unicode.org/glossary/#abstract_character] : A unit of information used for the organization, control, or representation of textual data.

Character Repertoire [https://unicode.org/glossary/#character_repertoire] : The collection of characters included in a character set.

Source character set : The abstract characters that must be representable in the internal character set used after phase 1 of translation. All characters not in the source character set are converted to universal-character-names, which are made up of characters from the basic character set. The abstract parser only sees characters in the source character set.

Basic execution character set : The abstract characters the character repertoire of the character set used for literals must include. A superset of the abstract characters in the basic source character set.

Execution character set : The set of abstract characters representable by a char or char string literal

Excecution wide-character set : The set of abstract characters representable by a wchar_t or wchar_t string literal

Associated Encoding : There are 5 types with associated encodings, char, wchar_t, char8_t, char16_t, and char32_t. For the latter three, the association is fixed to the UTF encoding of the same size, UTF-8, UTF-16, and UTF-32. For char and wchar_t the association is determined dynamically, based on the current locale. For each it is the scheme used to decode data of that type into code points.

Example of use of new words (not an actual proposal, yet)

Proposal Dnnnn

Loosly modeled after source_position, the callable objects literal_encoding and wide_literal_encoding provide mechanism for the compiler to expose information to the translation unit, in this case how to invert the mapping that the compiler knew when it encoded character and string literals.

literal_encoding

Returns an unspecified callable taking a range of elements of type char and returning a view of of code points decoded from the input range treating them as being in the literal encoding used for the current translation unit. For latin-1 it might be

[](auto&& r){return views::transform(r, [](unsigned char c) -> char32_t {return c;});

For any other single byte encoding a static lookup table of char32_t[256] could be used.

wide_literal_encoding

Returns an unspecified callable taking a range of elements of type char and returning a view of of code points decoded from the input range treating them as being in the wide literal encoding used for the current translation unit.

Discussion of proposal Dnnnn

Still woefully underspecified, it is at least clearer what is being discussed, and how it might be something a compiler could implement. Without the terms literal encoding and wide literal encoding discussion gets bogged down quickly around the difference between what the compiler does and what locale and the dynamic encoding imply for character conversions.

Wording

lex.charset

[lex.charset.1]{.pnum} The basic source character set consists of 96 [abstract]{.add} characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } \[ \] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " '

Editorial Note: Should really be a list of unicode names or universal names, aka code points e.g.

0041..005A    ; L&  [26] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER Z
0061..007A    ; L&  [26] LATIN SMALL LETTER A..LATIN SMALL LETTER Z
0030..0039    ; Nd  [10] DIGIT ZERO..DIGIT NINE
0020          ; Zs       SPACE
0021          ; Po       EXCLAMATION MARK
0022          ; Po       QUOTATION MARK
0023          ; Po       NUMBER SIGN
0025          ; Po       PERCENT SIGN
0026          ; Po       AMPERSAND
0027          ; Po       APOSTROPHE
0028          ; Ps       LEFT PARENTHESIS
0029          ; Pe       RIGHT PARENTHESIS
002A          ; Po       ASTERISK
002B          ; Sm       PLUS SIGN
002C          ; Po       COMMA
002D          ; Pd       HYPHEN-MINUS
002E          ; Po       FULL STOP
002F          ; Po       SOLIDUS
003A          ; Po       COLON
003B          ; Po       SEMICOLON
003C          ; Sm       LESS-THAN SIGN
003D          ; Sm       EQUALS SIGN
003E          ; Sm       GREATER-THAN SIGN
003F          ; Po       QUESTION MARK
005B          ; Ps       LEFT SQUARE BRACKET
005C          ; Po       REVERSE SOLIDUS
005D          ; Pe       RIGHT SQUARE BRACKET
005E          ; Sk       CIRCUMFLEX ACCENT
005F          ; Pc       LOW LINE
007B          ; Ps       LEFT CURLY BRACKET
007C          ; Sm       VERTICAL LINE
007D          ; Pe       RIGHT CURLY BRACKET
007E          ; Sm       TILDE

[lex.charset.3]{.pnum} The basic execution character set and the basic execution wide-character set shall each contain all the [members]{.rm} [abstract characters]{.add} of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0. For each [element in the]{.add} basic execution character set, the [encoded]{.add} values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The [encoded]{.add} values of the members of the execution character sets [and the sets of additional members]{.rm} are [implementation defined]{.add}[locale-specific]{.rm}.

lex.con

[lex.conn.2] A character literal that does not begin with u8, u, U, or L is an ordinary character literal. An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of [the encoding of]{.rm} the c-char in the [literal encoding]{.add}. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.

[lex.conn.6] A character literal that begins with the letter L, such as L'z', is a wide-character literal. A wide-character literal has type wchar_t. The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the [execution wide-character set]{.rm}[wide literal encoding]{.add}, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined. [[ Note: The type wchar_t is able to represent all members of the execution wide-character set (see [basic.fundamental]). — end note ]]{.rm} The value of a wide-character literal containing multiple c-chars is implementation-defined.

lex.string

[lex.string.6] After translation phase 6, a string-literal that does not begin with an encoding-prefix is an ordinary string literal. An ordinary string literal has type “array of n const char” where n is the size of the string as defined below, has static storage duration ([basic.stc]), and is initialized with the [given characters]{.rm}[values of the characters in the narrow literal encoding]{.add}.

[lex.string.8] Ordinary string literals [and UTF-8 string literals]{.rm} are also referred to as narrow string literals.

basic.fundemental

[basic.fundemental.8] Type wchar_t is a distinct type that has an implementation-defined signed or unsigned integer type as its underlying type. [The values of type wchar_t can represent distinct codes for all members of the largest extended character set specified among the supported locales ([locale]).]{.rm}

Addition somewhere in [lex]

The literal and string literal for char, wchar_t, char8_t, char16_t, and char32_t have associated encodings. For char8_t, char16_t, and char32_t, the encodings are fixed both for compile and execution as UTF-8, UTF-16, and UTF-32. There are no associated encodings for unsigned char or signed char. The associated encodings for char and wchar_t types are implementation defined. [Note: it is unspecified how translation units with differing associated encodings for char and wchar_t combine. Differing visible definitions will be ODR violations.]

Conclusions

Moving to a more modern and consistent terminology for character sets, character encodings, and a better definition of what types have associated encodings when, will make it more straight-forward to understand the current technical issues with text and to enable simpler proposals for text in the future.