Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ENSIP-15 Version and Links #288

Merged
merged 1 commit into from
Oct 8, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 35 additions & 31 deletions docs/ensip/15.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,19 @@ export const meta = {
ensip: {
status: 'draft',
created: '2023-04-03',
updated: '2023-09-18',
updated: '2024-09-14',
}
};

# ENSIP-15: ENS Name Normalization Standard

## Abstract

This ENSIP standardizes Ethereum Name Service (ENS) name normalization process outlined in [ENSIP-1 § Name Syntax](./ensip-1-ens.md#name-syntax).
This ENSIP standardizes Ethereum Name Service (ENS) name normalization process outlined in [ENSIP-1 § Name Syntax](./1#name-syntax).

## Motivation

* Since [ENSIP-1](./ensip-1-ens.md) (originally [EIP-137](https://eips.ethereum.org/EIPS/eip-137)) was finalized in 2016, Unicode has [evolved](https://unicode.org/history/publicationdates.html) from version 8.0.0 to 15.0.0 and incorporated many new characters, including complex emoji sequences.
* Since [ENSIP-1](./1) (originally [EIP-137](https://eips.ethereum.org/EIPS/eip-137)) was finalized in 2016, Unicode has [evolved](https://unicode.org/history/publicationdates.html) from version 8.0.0 to 15.0.0 and incorporated many new characters, including complex emoji sequences.
* ENSIP-1 does not state the version of Unicode.
* ENSIP-1 implies but does not state an explicit flavor of IDNA processing.
* [UTS-46](https://unicode.org/reports/tr46/) is insufficient to normalize emoji sequences. Correct emoji processing is only possible with [UTS-51](https://www.unicode.org/reports/tr51/).
Expand All @@ -34,10 +34,10 @@ This ENSIP standardizes Ethereum Name Service (ENS) name normalization process o

## Specification

* Unicode version `15.1.0`
* Unicode version `16.0.0`
* Normalization is a living specification and should use the latest stable version of Unicode.
* [`spec.json`](./ensip-15/spec.json) contains all [necessary data](#description-of-specjson) for normalization.
* [`nf.json`](./ensip-15/nf.json) contains all [necessary data](#description-of-nfjson) for [Unicode Normalization Forms](https://unicode.org/reports/tr15/) NFC and NFD.
* [`spec.json`](https://github.com/adraffy/ens-normalize.js/blob/main/derive/output/spec.json) contains all [necessary data](#description-of-specjson) for normalization.
* [`nf.json`](https://github.com/adraffy/ens-normalize.js/blob/main/derive/output/nf.json) contains all [necessary data](#description-of-nfjson) for [Unicode Normalization Forms](https://unicode.org/reports/tr15/) NFC and NFD.

### Definitions

Expand Down Expand Up @@ -67,18 +67,18 @@ This ENSIP standardizes Ethereum Name Service (ENS) name normalization process o
* All **Emoji Sequence** have explicit emoji-presentation.
* The convention of ignoring presentation is difficult to change because:
* Presentation characters (`FE0F` and `FE0E`) are **Ignored**
* [ENSIP-1](./ensip-1-ens.md) did not treat emoji differently from text
* [ENSIP-1](./1) did not treat emoji differently from text
* Registration hashes are immutable
* [Beautification](#annex-beautification) can be used to restore emoji-presentation in normalized names.

### Algorithm

* Normalization is the process of canonicalizing a name before for [hashing](./ensip-1-ens.md#namehash-algorithm).
* Normalization is the process of canonicalizing a name before for [hashing](./1#namehash-algorithm).
* It is idempotent: applying normalization multiple times produces the same result.
* For user convenience, leading and trailing whitespace should be trimmed before normalization, as all whitespace codepoints are disallowed. Inner characters should remain unmodified.
* No string transformations (like case-folding) should be applied.

1. [Split](#split) the name into [labels](./ensip-1-ens.md#name-syntax).
1. [Split](#split) the name into [labels](./1#name-syntax).
1. [Normalize](#normalize) each label.
1. [Join](#join) the labels together into a name again.

Expand All @@ -103,7 +103,7 @@ Examples:

### Tokenize

Convert a label into a list of `Text` and `Emoji` tokens, each with a payload of codepoints. The complete list of character types and [emoji sequences](./ensip-15/emoji.md#valid-emoji-sequences) can be found in [`spec.json`](#description-of-specjson).
Convert a label into a list of `Text` and `Emoji` tokens, each with a payload of codepoints. The complete list of character types and [emoji sequences](#appendix-additional-resources) can be found in [`spec.json`](#description-of-specjson).

1. Allocate an empty codepoint buffer.
1. Find the longest **Emoji Sequence** that matches the remaining input.
Expand Down Expand Up @@ -258,7 +258,7 @@ A label composed of confusable characters isn't necessarily confusable.

## Description of `spec.json`

* **Groups** (`"groups"`) — [groups](./ensip-15/groups.md) of characters that can constitute a label
* **Groups** (`"groups"`) — [groups](#appendix-additional-resources) of characters that can constitute a label
* `"name"` — ASCII name of the group (or abbreviation if **Restricted**)
* Examples: *Latin*, *Japanese*, *Egyp*
* **Restricted** (`"restricted"`) — **`true`** if [Excluded](https://www.unicode.org/reports/tr31#Table_Candidate_Characters_for_Exclusion_from_Identifiers) or [Limited-Use](https://www.unicode.org/reports/tr31/#Table_Limited_Use_Scripts) script
Expand All @@ -272,7 +272,7 @@ A label composed of confusable characters isn't necessarily confusable.
* Example: `à̀̀` → `E0 300 300`
* Currently, every group that is **CM Whitelist** has zero compound sequences.
* **CM Whitelisted** is effectively **`true`** if `[]` otherwise **`false`**
* **Ignored** (`"ignored"`) — [characters](./ensip-15/ignored.csv) that are ignored during normalization
* **Ignored** (`"ignored"`) — [characters](#appendix-additional-resources) that are ignored during normalization
* Example: `34F (�) COMBINING GRAPHEME JOINER`
* **Mapped** (`"mapped"`) — characters that are mapped to a sequence of **valid** characters
* Example: `41 (A) LATIN CAPITAL LETTER A` → `[61 (a) LATIN SMALL LETTER A]`
Expand All @@ -282,15 +282,15 @@ A label composed of confusable characters isn't necessarily confusable.
* Example: `34 (4) DIGIT FOUR`
* **Confused** (`"confused"`) — subset of confusable characters that confuse
* Example: `13CE (Ꮞ) CHEROKEE LETTER SE`
* **Fenced** (`"fenced"`) — [characters](./ensip-15/fenced.csv) that cannot be first, last, or contiguous
* **Fenced** (`"fenced"`) — [characters](#appendix-additional-resources) that cannot be first, last, or contiguous
* Example: `2044 (⁄) FRACTION SLASH`
* **Emoji Sequence(s)** (`"emoji"`) — valid [emoji sequences](./ensip-15/emoji.md#valid-emoji-sequences)
* **Emoji Sequence(s)** (`"emoji"`) — valid [emoji sequences](#appendix-additional-resources)
* Example: `👨‍💻 [1F468 200D 1F4BB] man technologist`
* **Combining Marks / CM** (`"cm"`) — [characters](./ensip-15/cm.csv) that are [Combining Marks](https://unicode.org/faq/char_combmark.html)
* **Non-spacing Marks / NSM** (`"nsm"`) — valid [subset](./ensip-15/nsm.csv) of **CM** with general category (`"Mn"` or `"Me"`)
* **Combining Marks / CM** (`"cm"`) — [characters](#appendix-additional-resources) that are [Combining Marks](https://unicode.org/faq/char_combmark.html)
* **Non-spacing Marks / NSM** (`"nsm"`) — valid [subset](#appendix-additional-resources) of **CM** with general category (`"Mn"` or `"Me"`)
* **Maximum NSM** (`"nsm_max"`) — maximum sequence length of unique **NSM**
* **Should Escape** (`"escape"`) — [characters](./ensip-15/escape.csv) that shouldn't be printed
* **NFC Check** (`"nfc_check"`) — valid [subset](./ensip-15/nfc_check.csv) of characters that [may require NFC](https://unicode.org/reports/tr15/#NFC_QC_Optimization)
* **Should Escape** (`"escape"`) — [characters](#appendix-additional-resources) that shouldn't be printed
* **NFC Check** (`"nfc_check"`) — valid [subset](#appendix-additional-resources) of characters that [may require NFC](https://unicode.org/reports/tr15/#NFC_QC_Optimization)

## Description of `nf.json`

Expand Down Expand Up @@ -343,7 +343,7 @@ A label composed of confusable characters isn't necessarily confusable.
* `3002 (。) IDEOGRAPHIC FULL STOP`
* `FF0E (.) FULLWIDTH FULL STOP`
* `FF61 (。) HALFWIDTH IDEOGRAPHIC FULL STOP`
* [Many characters](./ensip-15/disallowed.csv) are **disallowed** for various reasons:
* [Many characters](#appendix-additional-resources) are **disallowed** for various reasons:
* Nearly all punctuation are **disallowed**.
* Example: `589 (։) ARMENIAN FULL STOP`
* All parentheses and brackets are **disallowed**.
Expand Down Expand Up @@ -379,7 +379,7 @@ A label composed of confusable characters isn't necessarily confusable.
* `2E3A (⸺) TWO-EM DASH` → `"--"`
* `2E3B (⸻) THREE-EM DASH` → `"---"`
* Characters are assigned to **Groups** according to [Unicode Script_Extensions](https://www.unicode.org/reports/tr24/#Script_Extensions_Def).
* **Groups** may contain [multiple scripts](./ensip-15/groups.md):
* **Groups** may contain [multiple scripts](#appendix-additional-resources):
* Only *Latin*, *Greek*, *Cyrillic*, *Han*, *Japanese*, and *Korean* have access to *Common* characters.
* *Latin*, *Greek*, *Cyrillic*, *Han*, *Japanese*, *Korean*, and *Bopomofo* only permit specific **Combining Mark** sequences.
* *Han*, *Japanese*, and *Korean* have access to `a-z`.
Expand All @@ -390,9 +390,9 @@ A label composed of confusable characters isn't necessarily confusable.
* Ethereum symbol (`39E (Ξ) GREEK CAPITAL LETTER XI`) is case-folded and *Common*.
* Emoji:
* All emoji are [fully-qualified](https://www.unicode.org/reports/tr51/#def_fully_qualified_emoji).
* Digits (`0-9`) are [not emoji](./ensip-15/emoji.md#demoted-unchanged).
* Emoji [mapped to non-emoji by IDNA](./ensip-15/emoji.md#demoted-mapped) cannot be used as emoji.
* Emoji [disallowed by IDNA](./ensip-15/emoji.md#disabled-emoji-characters) with default text-presentation are **disabled**:
* Digits (`0-9`) are [not emoji](#appendix-additional-resources).
* Emoji [mapped to non-emoji by IDNA](#appendix-additional-resources) cannot be used as emoji.
* Emoji [disallowed by IDNA](#appendix-additional-resources) with default text-presentation are **disabled**:
* `203C (‼️) double exclamation mark`
* `2049 (⁉️) exclamation question mark `
* Remaining emoji characters are marked as **disallowed** (for text processing).
Expand All @@ -418,7 +418,7 @@ A label composed of confusable characters isn't necessarily confusable.

* 99% of names are still valid.
* Preserves as much [Unicode IDNA](https://unicode.org/reports/tr46/) and [WHATWG URL](https://url.spec.whatwg.org/#idna) compatibility as possible.
* Only [valid emoji sequences](./ensip-15/emoji.md#valid-emoji-sequences) are permitted.
* Only [valid emoji sequences](#appendix-additional-resources) are permitted.

## Security Considerations

Expand Down Expand Up @@ -454,7 +454,7 @@ Copyright and related rights waived via [CC0](https://creativecommons.org/public
## Appendix: Reference Specifications

* [EIP-137: Ethereum Domain Name Service](https://eips.ethereum.org/EIPS/eip-137)
* [ENSIP-1: ENS](./ensip-1-ens.md)
* [ENSIP-1: ENS](./1)
* [UAX-15: Normalization Forms](https://unicode.org/reports/tr15/)
* [UAX-24: Script Property](https://www.unicode.org/reports/tr24/)
* [UAX-29: Text Segmentation](https://unicode.org/reports/tr29/)
Expand All @@ -471,15 +471,19 @@ Copyright and related rights waived via [CC0](https://creativecommons.org/public

## Appendix: Additional Resources

* [Supported Groups](./ensip-15/groups.md)
* [Supported Emoji](./ensip-15/emoji.md)
* [Additional Disallowed Characters](./ensip-15/disallowed.csv)
* [**Ignored** Characters](./ensip-15/ignored.csv)
* [**Should Escape** Characters ](./ensip-15/ignored.csv)
* [Supported Groups](https://github.com/adraffy/ens-normalize.js/blob/main/tools/ensip/groups.md)
* [Supported Emoji](https://github.com/adraffy/ens-normalize.js/blob/main/tools/ensip/emoji.md)
* [Additional Disallowed Characters](https://github.com/adraffy/ens-normalize.js/blob/main/tools/ensip/disallowed.csv)
* [Ignored Characters](https://github.com/adraffy/ens-normalize.js/blob/main/tools/ensip/ignored.csv)
* [Should Escape Characters ](https://github.com/adraffy/ens-normalize.js/blob/main/tools/ensip/escape.csv)
* [Combining Marks](https://github.com/adraffy/ens-normalize.js/blob/main/tools/ensip/cm.csv)
* [Non-spacing Marks](https://github.com/adraffy/ens-normalize.js/blob/main/tools/ensip/nsm.csv)
* [Fenced Characters](https://github.com/adraffy/ens-normalize.js/blob/main/tools/ensip/fenced.csv)
* [NFC Quick Check](https://github.com/adraffy/ens-normalize.js/blob/main/tools/ensip/nfc_check.csv)

## Appendix: Validation Tests

A list of [validation tests](./ensip-15/tests.json) are provided with the following interpretation:
A list of [validation tests](https://github.com/adraffy/ens-normalize.js/blob/main/validate/tests.json) are provided with the following interpretation:

* Already Normalized: `{name: "a"}` → `normalize("a")` is `"a"`
* Need Normalization: `{name: "A", norm: "a"}` → `normalize("A")` is `"a"`
Expand Down