forked from whatwg/encoding
-
Notifications
You must be signed in to change notification settings - Fork 0
/
encoding.bs
3367 lines (2545 loc) · 125 KB
/
encoding.bs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<pre class=metadata>
Group: WHATWG
H1: Encoding
Shortname: encoding
Text Macro: TWITTER encodings
Abstract: The Encoding Standard defines encodings and their JavaScript API.
Translation: ja https://triple-underscore.github.io/Encoding-ja.html
Markup Shorthands: css off
Translate IDs: dictdef-textdecoderoptions textdecoderoptions,dictdef-textdecodeoptions textdecodeoptions,index section-index
</pre>
<link rel=stylesheet href=visualization-colors.css>
<pre class=link-defaults>
spec:streams; type:interface; text:ReadableStream
</pre>
<h2 id=preface>Preface</h2>
<p>The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the
universal coded character set. Therefore for new protocols and formats, as well as
existing formats deployed in new contexts, this specification requires (and defines) the
UTF-8 encoding.
<p>The other (legacy) encodings have been defined to some extent in the past. However,
user agents have not always implemented them in the same way, have not always used the
same labels, and often differ in dealing with undefined and former proprietary areas of
encodings. This specification addresses those gaps so that new user agents do not have to
reverse engineer encoding implementations and existing user agents can converge.
<p>In particular, this specification defines all those encodings, their algorithms to go
from bytes to scalar values and back, and their canonical names and identifying labels.
This specification also defines an API to expose part of the encoding algorithms to
JavaScript.
<p>User agents have also significantly deviated from the labels listed in the
<a href=https://www.iana.org/assignments/character-sets/character-sets.xhtml>IANA Character Sets registry</a>.
To stop spreading legacy encodings further, this specification is exhaustive about the
aforementioned details and therefore has no need for the registry. In particular, this
specification does not provide a mechanism for extending any aspect of encodings.
<h2 id=security-background>Security background</h2>
<p>There is a set of encoding security issues when the producer and consumer do not agree
on the encoding in use, or on the way a given encoding is to be implemented. For instance,
an attack was reported in 2011 where a <a>Shift_JIS</a> lead byte 0x82 was used to
“mask” a 0x22 trail byte in a JSON resource of which an attacker could control some field.
The producer did not see the problem even though this is an illegal byte combination. The
consumer decoded it as a single U+FFFD and therefore changed the overall interpretation as
U+0022 is an important delimiter. Decoders of encodings that use multiple bytes for scalar
values now require that in case of an illegal byte combination, a scalar value in the
range U+0000 to U+007F, inclusive, cannot be “masked”. For the aforementioned sequence the
output would be U+FFFD U+0022.
<p>This is a larger issue for encodings that map anything that is an <a>ASCII byte</a> to
something that is not an <a>ASCII code point</a>, when there is no lead byte present. These
are “ASCII-incompatible” encodings and other than <a>ISO-2022-JP</a>, <a>UTF-16BE</a>,
and <a>UTF-16LE</a>, which are unfortunately required due to deployed content, they are not
supported. (Investigation is
<a href=https://github.com/whatwg/encoding/issues/8 lt="Add more labels to the replacement encoding">ongoing</a>
whether more labels of other such encodings can be mapped to the <a>replacement</a>
encoding, rather than the unknown encoding fallback.) An example attack is injecting
carefully crafted content into a resource and then encouraging the user to override the
encoding, resulting in e.g. script execution.
<p>Encoders used by URLs found in HTML and HTML's form feature can also result in slight
information loss when an encoding is used that cannot represent all scalar values. E.g.
when a resource uses the <a>windows-1252</a> encoding a server will not be able to
distinguish between an end user entering “💩” and “&#128169;” into a form.
<p>The problems outlined here go away when exclusively using UTF-8, which is one of the
many reasons that is now the mandatory encoding for all things.
<p class=note>See also the <a href=#browser-ui>Browser UI</a> chapter.
<h2 id=terminology>Terminology</h2>
<p>This specification depends on the Infra Standard. [[!INFRA]]
<p>Hexadecimal numbers are prefixed with "0x".
<p>In equations, all numbers are integers, addition is represented by "+", subtraction by "−",
multiplication by "×", integer division by "/" (returns the quotient), modulo by "%" (returns the
remainder of an integer division), logical left shifts by "<<", logical right shifts by ">>",
bitwise AND by "&", and bitwise OR by "|".
<p>For logical right shifts operands must have at least twenty-one bits precision.
<hr>
<p>A <dfn id=concept-token>token</dfn> is a piece of data, such as a <a>byte</a> or
<a>scalar value</a>.
<p>A <dfn id=concept-stream>stream</dfn> represents an ordered sequence of
<a>tokens</a>. <dfn>End-of-stream</dfn> is a special
<a>token</a> that signifies no more
<a>tokens</a> are in the
<a for=/>stream</a>.
<p>When a <a>token</a> is
<dfn id=concept-stream-read for=stream>read</dfn> from a <a for=/>stream</a>,
the first token in the stream must be returned and subsequently removed, and
<a>end-of-stream</a> must be returned otherwise.
<!-- this means read is blocking on e.g. networking activity;
SimonSapin thinks this is fine, blame him if not -->
<p>When one or more <a>tokens</a> are
<dfn id=concept-stream-prepend for=stream>prepended</dfn> to a
<a for=/>stream</a>, those tokens must be inserted, in given order,
before the first token in the stream.
<p class=example id=example-tokens>Inserting the sequence of tokens <code>&#128169;</code>
in a stream "<code> hello world</code>", results in a stream
"<code>&#128169; hello world</code>". The next token to be read would be
<code>&</code>. <!-- 💩 -->
<p>When one or more <a>tokens</a> are
<dfn id=concept-stream-push for=stream>pushed</dfn> to a <a for=/>stream</a>,
those tokens must be inserted, in given order, after the last token in the stream.
<h2 id=encodings>Encodings</h2>
<p>An <dfn export>encoding</dfn> defines a mapping from a <a>scalar value</a> sequence to
a <a>byte</a> sequence (and vice versa). Each <a for=/>encoding</a> has a
<dfn id=name export for=encoding>name</dfn>, and one or more
<dfn id=label export for=encoding lt=label>labels</dfn>.
<p class="note no-backref">This specification defines three <a for=/>encodings</a> with the same
names as <i>encoding schemes</i> defined in the Unicode standard: <a>UTF-8</a>, <a>UTF-16LE</a>, and
<a>UTF-16BE</a>. The <a for=/>encodings</a> differ from the <i>encoding schemes</i> by byte order
mark (also known as BOM) handling not being part of the <a for=/>encodings</a> themselves and
instead being part of wrapper algorithms in this specification, whereas byte order mark handling is
part of the definition of the <i>encoding schemes</i> in the Unicode Standard. <a>UTF-8</a> used
together with the <a>UTF-8 decode</a> algorithm matches the <i>encoding scheme</i> of the same name.
This specification does not provide wrapper algorithms that would combine with <a>UTF-16LE</a> and
<a>UTF-16BE</a> to match the similarly-named <i>encoding schemes</i>. [[UNICODE]]
<h3 id=encoders-and-decoders>Encoders and decoders</h3>
<p>Each <a for=/>encoding</a> has an associated <dfn>decoder</dfn> and most of them have an
associated <dfn>encoder</dfn>. Each <a for=/>decoder</a> and <a for=/>encoder</a> have a
<dfn>handler</dfn> algorithm. A <a>handler</a> algorithm takes an input
<a for=/>stream</a> and a <a>token</a>, and returns
<dfn>finished</dfn>, one or more <a>tokens</a>, <dfn>error</dfn>
optionally with a <a>code point</a>, or <dfn>continue</dfn>.
<p class="note no-backref">The <a>replacement</a>, <a>UTF-16BE</a>, and
<a>UTF-16LE</a> <a for=/>encodings</a> have no <a for=/>encoder</a>.
<p>An <dfn>error mode</dfn> as used below is "<code>replacement</code>" (default) or
"<code>fatal</code>" for a <a for=/>decoder</a> and "<code>fatal</code>" (default) or
"<code>html</code>" for an <a for=/>encoder</a>.
<p class=note>An XML processor would set <a for=/>error mode</a> to "<code>fatal</code>".
[[XML]]
<p class=note><code>html</code> exists as <a for=/>error mode</a> due to URLs and HTML forms
requiring a non-terminating legacy <a for=/>encoder</a>. The "<code>html</code>"
<a for=/>error mode</a> causes a sequence to be emitted that cannot be distinguished from
legitimate input and can therefore lead to silent data loss. Developers are strongly
encouraged to use the <a>UTF-8</a> <a for=/>encoding</a> to prevent this from
happening.
[[URL]]
[[HTML]]
<p>To <dfn id=concept-encoding-run for=encoding>run</dfn> an <a for=/>encoding</a>'s
<a for=/>decoder</a> or <a for=/>encoder</a> <var>encoderDecoder</var> with input
<a for=/>stream</a> <var>input</var>, output
<a for=/>stream</a> <var>output</var>, and optional
<a for=/>error mode</a> <var>mode</var>, run these steps:
<ol>
<li><p>If <var>mode</var> is not given, set it to "<code>replacement</code>", if
<var>encoderDecoder</var> is a <a for=/>decoder</a>, and "<code>fatal</code>" otherwise.
<li><p>Let <var>encoderDecoderInstance</var> be a new <var>encoderDecoder</var>.
<li>
<p>While true:
<ol>
<li><p>Let <var>result</var> be the result of
<a>processing</a> the result of
<a>reading</a> from <var>input</var> for
<var>encoderDecoderInstance</var>, <var>input</var>, <var>output</var>, and
<var>mode</var>.
<li><p>If <var>result</var> is not <a>continue</a>, return <var>result</var>.
<li><p>Otherwise, do nothing.
</ol>
</ol>
<p>To <dfn id=concept-encoding-process for=encoding>process</dfn> a
<a>token</a> <var>token</var> for an <a for=/>encoding</a>'s
<a for=/>encoder</a> or <a for=/>decoder</a> instance <var>encoderDecoderInstance</var>,
<a for=/>stream</a> <var>input</var>, output
<a for=/>stream</a> <var>output</var>, and optional
<a for=/>error mode</a> <var>mode</var>, run these steps:
<ol>
<li><p>If <var>mode</var> is not given, set it to "<code>replacement</code>", if
<var>encoderDecoderInstance</var> is a <a for=/>decoder</a> instance, and "<code>fatal</code>"
otherwise.
<li><p>Assert: if <var>encoderDecoderInstance</var> is an <a for=/>encoder</a> instance,
<var>token</var> is not a <a>surrogate</a>.
<li><p>Let <var>result</var> be the result of running <var>encoderDecoderInstance</var>'s
<a>handler</a> on <var>input</var> and <var>token</var>.
<li><p>If <var>result</var> is <a>continue</a> or <a>finished</a>, return
<var>result</var>.
<li>
<p>Otherwise, if <var>result</var> is one or more <a>tokens</a>:
<ol>
<li><p>Assert: if <var>encoderDecoderInstance</var> is a <a for=/>decoder</a> instance,
<var>result</var> does not contain any <a>surrogates</a>.
<li><p><a>Push</a> <var>result</var> to <var>output</var>.
</ol>
<li>
<p>Otherwise, if <var>result</var> is <a>error</a>, switch on <var>mode</var> and
run the associated steps:
<dl class=switch>
<dt>"<code>replacement</code>"
<dd><a>Push</a> U+FFFD to <var>output</var>.
<dt>"<code>html</code>"
<dd><a>Prepend</a> U+0026, U+0023, followed by the
shortest sequence of <a>ASCII digits</a> representing <var>result</var>'s
<a>code point</a> in base ten, followed by U+003B to <var>input</var>.
<!-- &# ... ; -->
<dt>"<code>fatal</code>"
<dd>Return <a>error</a>.
</dl>
<li>Return <a>continue</a>.
</ol>
<h3 id=names-and-labels>Names and labels</h3>
<p>The table below lists all <a for=/>encodings</a>
and their <a>labels</a> user agents must support.
User agents must not support any other <a for=/>encodings</a>
or <a>labels</a>.
<p class=note>For each encoding, <a lt="ASCII lowercase">ASCII-lowercasing</a> its
<a for=encoding>name</a> yields one of its <a for=encoding>labels</a>.
<p>Authors must use the <a>UTF-8</a> <a for=/>encoding</a> and must use the
<a>ASCII case-insensitive</a> "<code>utf-8</code>" <a>label</a> to
identify it.
<p>New protocols and formats, as well as existing formats deployed in new contexts, must
use the <a>UTF-8</a> <a for=/>encoding</a> exclusively. If these protocols and
formats need to expose the <a for=/>encoding</a>'s <a>name</a> or
<a>label</a>, they must expose it as "<code>utf-8</code>".
<p>To
<dfn export lt="get an encoding|getting an encoding" id=concept-encoding-get>get an encoding</dfn>
from a string <var>label</var>, run these steps:
<ol>
<li><p>Remove any leading and trailing <a>ASCII whitespace</a> from
<var>label</var>.
<li><p>If <var>label</var> is an <a>ASCII case-insensitive</a>
match for any of the <a>labels</a> listed in the table
below, return the corresponding <a for=/>encoding</a>, and failure otherwise.
</ol>
<p class="note no-backref">This is a more basic and restrictive algorithm of mapping <a>labels</a>
to <a for=/>encodings</a> than
<a href=https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching>section 1.4 of Unicode Technical Standard #22</a>
prescribes, as that is necessary to be compatible with deployed content.
<table>
<thead>
<tr>
<th><a>Name</a>
<th><a>Labels</a>
<tbody>
<tr><th colspan=2><a href=#the-encoding>The Encoding</a>
<tr>
<td rowspan=3><a>UTF-8</a>
<td>"<code>unicode-1-1-utf-8</code>"
<tr><td>"<code>utf-8</code>"
<tr><td>"<code>utf8</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-single-byte-encodings>Legacy single-byte encodings</a>
<tr>
<td rowspan=4><a>IBM866</a>
<td>"<code>866</code>"
<tr><td>"<code>cp866</code>"
<tr><td>"<code>csibm866</code>"
<tr><td>"<code>ibm866</code>"
<tr>
<td rowspan=9><a>ISO-8859-2</a>
<td>"<code>csisolatin2</code>"
<tr><td>"<code>iso-8859-2</code>"
<tr><td>"<code>iso-ir-101</code>"
<tr><td>"<code>iso8859-2</code>"
<tr><td>"<code>iso88592</code>"
<tr><td>"<code>iso_8859-2</code>"
<tr><td>"<code>iso_8859-2:1987</code>"
<tr><td>"<code>l2</code>"
<tr><td>"<code>latin2</code>"
<tr>
<td rowspan=9><a>ISO-8859-3</a>
<td>"<code>csisolatin3</code>"
<tr><td>"<code>iso-8859-3</code>"
<tr><td>"<code>iso-ir-109</code>"
<tr><td>"<code>iso8859-3</code>"
<tr><td>"<code>iso88593</code>"
<tr><td>"<code>iso_8859-3</code>"
<tr><td>"<code>iso_8859-3:1988</code>"
<tr><td>"<code>l3</code>"
<tr><td>"<code>latin3</code>"
<tr>
<td rowspan=9><a>ISO-8859-4</a>
<td>"<code>csisolatin4</code>"
<tr><td>"<code>iso-8859-4</code>"
<tr><td>"<code>iso-ir-110</code>"
<tr><td>"<code>iso8859-4</code>"
<tr><td>"<code>iso88594</code>"
<tr><td>"<code>iso_8859-4</code>"
<tr><td>"<code>iso_8859-4:1988</code>"
<tr><td>"<code>l4</code>"
<tr><td>"<code>latin4</code>"
<tr>
<td rowspan=8><a>ISO-8859-5</a>
<td>"<code>csisolatincyrillic</code>"
<tr><td>"<code>cyrillic</code>"
<tr><td>"<code>iso-8859-5</code>"
<tr><td>"<code>iso-ir-144</code>"
<tr><td>"<code>iso8859-5</code>"
<tr><td>"<code>iso88595</code>"
<tr><td>"<code>iso_8859-5</code>"
<tr><td>"<code>iso_8859-5:1988</code>"
<tr>
<td rowspan=14><a>ISO-8859-6</a>
<td>"<code>arabic</code>"
<tr><td>"<code>asmo-708</code>"
<tr><td>"<code>csiso88596e</code>"
<tr><td>"<code>csiso88596i</code>"
<tr><td>"<code>csisolatinarabic</code>"
<tr><td>"<code>ecma-114</code>"
<tr><td>"<code>iso-8859-6</code>"
<tr><td>"<code>iso-8859-6-e</code>"
<tr><td>"<code>iso-8859-6-i</code>"
<tr><td>"<code>iso-ir-127</code>"
<tr><td>"<code>iso8859-6</code>"
<tr><td>"<code>iso88596</code>"
<tr><td>"<code>iso_8859-6</code>"
<tr><td>"<code>iso_8859-6:1987</code>"
<tr>
<td rowspan=12><a>ISO-8859-7</a>
<td>"<code>csisolatingreek</code>"
<tr><td>"<code>ecma-118</code>"
<tr><td>"<code>elot_928</code>"
<tr><td>"<code>greek</code>"
<tr><td>"<code>greek8</code>"
<tr><td>"<code>iso-8859-7</code>"
<tr><td>"<code>iso-ir-126</code>"
<tr><td>"<code>iso8859-7</code>"
<tr><td>"<code>iso88597</code>"
<tr><td>"<code>iso_8859-7</code>"
<tr><td>"<code>iso_8859-7:1987</code>"
<tr><td>"<code>sun_eu_greek</code>"
<tr>
<td rowspan=11><a>ISO-8859-8</a>
<td>"<code>csiso88598e</code>"
<tr><td>"<code>csisolatinhebrew</code>"
<tr><td>"<code>hebrew</code>"
<tr><td>"<code>iso-8859-8</code>"
<tr><td>"<code>iso-8859-8-e</code>"
<tr><td>"<code>iso-ir-138</code>"
<tr><td>"<code>iso8859-8</code>"
<tr><td>"<code>iso88598</code>"
<tr><td>"<code>iso_8859-8</code>"
<tr><td>"<code>iso_8859-8:1988</code>"
<tr><td>"<code>visual</code>"
<tr>
<td rowspan=3><a>ISO-8859-8-I</a>
<td>"<code>csiso88598i</code>"
<tr><td>"<code>iso-8859-8-i</code>"
<tr><td>"<code>logical</code>"
<tr>
<td rowspan=7><a>ISO-8859-10</a>
<td>"<code>csisolatin6</code>"
<tr><td>"<code>iso-8859-10</code>"
<tr><td>"<code>iso-ir-157</code>"
<tr><td>"<code>iso8859-10</code>"
<tr><td>"<code>iso885910</code>"
<tr><td>"<code>l6</code>"
<tr><td>"<code>latin6</code>"
<tr>
<td rowspan=3><a>ISO-8859-13</a>
<td>"<code>iso-8859-13</code>"
<tr><td>"<code>iso8859-13</code>"
<tr><td>"<code>iso885913</code>"
<tr>
<td rowspan=3><a>ISO-8859-14</a>
<td>"<code>iso-8859-14</code>"
<tr><td>"<code>iso8859-14</code>"
<tr><td>"<code>iso885914</code>"
<tr>
<td rowspan=6><a>ISO-8859-15</a>
<td>"<code>csisolatin9</code>"
<tr><td>"<code>iso-8859-15</code>"
<tr><td>"<code>iso8859-15</code>"
<tr><td>"<code>iso885915</code>"
<tr><td>"<code>iso_8859-15</code>"
<tr><td>"<code>l9</code>"
<tr>
<td><a>ISO-8859-16</a>
<td>"<code>iso-8859-16</code>"
<tr>
<td rowspan=5><a>KOI8-R</a>
<td>"<code>cskoi8r</code>"
<tr><td>"<code>koi</code>"
<tr><td>"<code>koi8</code>"
<tr><td>"<code>koi8-r</code>"
<tr><td>"<code>koi8_r</code>"
<tr>
<td rowspan=2><a>KOI8-U</a>
<td>"<code>koi8-ru</code>"
<tr><td>"<code>koi8-u</code>"
<tr>
<td rowspan=4><a>macintosh</a>
<td>"<code>csmacintosh</code>"
<tr><td>"<code>mac</code>"
<tr><td>"<code>macintosh</code>"
<tr><td>"<code>x-mac-roman</code>"
<tr>
<td rowspan=6><a>windows-874</a>
<td>"<code>dos-874</code>"
<tr><td>"<code>iso-8859-11</code>"
<tr><td>"<code>iso8859-11</code>"
<tr><td>"<code>iso885911</code>"
<tr><td>"<code>tis-620</code>"
<tr><td>"<code>windows-874</code>"
<tr>
<td rowspan=3><a>windows-1250</a>
<td>"<code>cp1250</code>"
<tr><td>"<code>windows-1250</code>"
<tr><td>"<code>x-cp1250</code>"
<tr>
<td rowspan=3><a>windows-1251</a>
<td>"<code>cp1251</code>"
<tr><td>"<code>windows-1251</code>"
<tr><td>"<code>x-cp1251</code>"
<tr>
<td rowspan=17><a>windows-1252</a>
<td>"<code>ansi_x3.4-1968</code>"
<tr><td>"<code>ascii</code>"
<tr><td>"<code>cp1252</code>"
<tr><td>"<code>cp819</code>"
<tr><td>"<code>csisolatin1</code>"
<tr><td>"<code>ibm819</code>"
<tr><td>"<code>iso-8859-1</code>"
<tr><td>"<code>iso-ir-100</code>"
<tr><td>"<code>iso8859-1</code>"
<tr><td>"<code>iso88591</code>"
<tr><td>"<code>iso_8859-1</code>"
<tr><td>"<code>iso_8859-1:1987</code>"
<tr><td>"<code>l1</code>"
<tr><td>"<code>latin1</code>"
<tr><td>"<code>us-ascii</code>"
<tr><td>"<code>windows-1252</code>"
<tr><td>"<code>x-cp1252</code>"
<tr>
<td rowspan=3><a>windows-1253</a>
<td>"<code>cp1253</code>"
<tr><td>"<code>windows-1253</code>"
<tr><td>"<code>x-cp1253</code>"
<tr>
<td rowspan=12><a>windows-1254</a>
<td>"<code>cp1254</code>"
<tr><td>"<code>csisolatin5</code>"
<tr><td>"<code>iso-8859-9</code>"
<tr><td>"<code>iso-ir-148</code>"
<tr><td>"<code>iso8859-9</code>"
<tr><td>"<code>iso88599</code>"
<tr><td>"<code>iso_8859-9</code>"
<tr><td>"<code>iso_8859-9:1989</code>"
<tr><td>"<code>l5</code>"
<tr><td>"<code>latin5</code>"
<tr><td>"<code>windows-1254</code>"
<tr><td>"<code>x-cp1254</code>"
<tr>
<td rowspan=3><a>windows-1255</a>
<td>"<code>cp1255</code>"
<tr><td>"<code>windows-1255</code>"
<tr><td>"<code>x-cp1255</code>"
<tr>
<td rowspan=3><a>windows-1256</a>
<td>"<code>cp1256</code>"
<tr><td>"<code>windows-1256</code>"
<tr><td>"<code>x-cp1256</code>"
<tr>
<td rowspan=3><a>windows-1257</a>
<td>"<code>cp1257</code>"
<tr><td>"<code>windows-1257</code>"
<tr><td>"<code>x-cp1257</code>"
<tr>
<td rowspan=3><a>windows-1258</a>
<td>"<code>cp1258</code>"
<tr><td>"<code>windows-1258</code>"
<tr><td>"<code>x-cp1258</code>"
<tr>
<td rowspan=2><a>x-mac-cyrillic</a>
<td>"<code>x-mac-cyrillic</code>"
<tr><td>"<code>x-mac-ukrainian</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-multi-byte-chinese-(simplified)-encodings>Legacy multi-byte Chinese (simplified) encodings</a>
<tr>
<td rowspan=9><a>GBK</a>
<td>"<code>chinese</code>"
<tr><td>"<code>csgb2312</code>"
<tr><td>"<code>csiso58gb231280</code>"
<tr><td>"<code>gb2312</code>"
<tr><td>"<code>gb_2312</code>"
<tr><td>"<code>gb_2312-80</code>"
<tr><td>"<code>gbk</code>"
<tr><td>"<code>iso-ir-58</code>"
<tr><td>"<code>x-gbk</code>"
<tr>
<td><a>gb18030</a>
<td>"<code>gb18030</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-multi-byte-chinese-(traditional)-encodings>Legacy multi-byte Chinese (traditional) encodings</a>
<tr>
<td rowspan=5><a>Big5</a>
<td>"<code>big5</code>"
<tr><td>"<code>big5-hkscs</code>"
<tr><td>"<code>cn-big5</code>"
<tr><td>"<code>csbig5</code>"
<tr><td>"<code>x-x-big5</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-multi-byte-japanese-encodings>Legacy multi-byte Japanese encodings</a>
<tr>
<td rowspan=3><a>EUC-JP</a>
<td>"<code>cseucpkdfmtjapanese</code>"
<tr><td>"<code>euc-jp</code>"
<tr><td>"<code>x-euc-jp</code>"
<tr>
<td rowspan=2><a>ISO-2022-JP</a>
<td>"<code>csiso2022jp</code>"
<tr><td>"<code>iso-2022-jp</code>"
<tr>
<td rowspan=8><a>Shift_JIS</a>
<td>"<code>csshiftjis</code>"
<tr><td>"<code>ms932</code>"
<tr><td>"<code>ms_kanji</code>"
<tr><td>"<code>shift-jis</code>"
<tr><td>"<code>shift_jis</code>"
<tr><td>"<code>sjis</code>"
<tr><td>"<code>windows-31j</code>"
<tr><td>"<code>x-sjis</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-multi-byte-korean-encodings>Legacy multi-byte Korean encodings</a>
<tr>
<td rowspan=10><a>EUC-KR</a>
<td>"<code>cseuckr</code>"
<tr><td>"<code>csksc56011987</code>"
<tr><td>"<code>euc-kr</code>"
<tr><td>"<code>iso-ir-149</code>"
<tr><td>"<code>korean</code>"
<tr><td>"<code>ks_c_5601-1987</code>"
<tr><td>"<code>ks_c_5601-1989</code>"
<tr><td>"<code>ksc5601</code>"
<tr><td>"<code>ksc_5601</code>"
<tr><td>"<code>windows-949</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-miscellaneous-encodings>Legacy miscellaneous encodings</a>
<tr>
<td rowspan=6><a>replacement</a>
<td>"<code>csiso2022kr</code>"
<tr><td>"<code>hz-gb-2312</code>"
<tr><td>"<code>iso-2022-cn</code>"
<tr><td>"<code>iso-2022-cn-ext</code>"
<tr><td>"<code>iso-2022-kr</code>"
<tr><td>"<code>replacement</code>"
<tr>
<td><a>UTF-16BE</a>
<td>"<code>utf-16be</code>"
<tr>
<td rowspan=2><a>UTF-16LE</a>
<td>"<code>utf-16</code>"
<tr><td>"<code>utf-16le</code>"
<tr>
<td><a>x-user-defined</a>
<td>"<code>x-user-defined</code>"
</table>
<p class=note>All <a for=/>encodings</a> and their
<a>labels</a> are also available as non-normative
<a href=encodings.json>encodings.json</a> resource.
<h3 id=output-encodings>Output encodings</h3>
<p>To <dfn export>get an output encoding</dfn> from an <a for=/>encoding</a>
<var>encoding</var>, run these steps:
<ol>
<li><p>If <var>encoding</var> is <a>replacement</a>, <a>UTF-16BE</a>, or
<a>UTF-16LE</a>, return <a>UTF-8</a>.
<li><p>Return <var>encoding</var>.
</ol>
<p class=note>The <a>get an output encoding</a> algorithm is useful for URL parsing and HTML
form submission, which both need exactly this.
<h2 id=indexes>Indexes</h2>
<p>Most legacy <a for=/>encodings</a> make use of an <dfn id=index>index</dfn>. An
<a>index</a> is an ordered list of entries, each entry consisting of a pointer and a
corresponding code point. Within an <a>index</a> pointers are unique and code points can be
duplicated.
<p class="note no-backref">An efficient implementation likely has two
<a lt=index>indexes</a> per <a for=/>encoding</a>. One optimized for its
<a for=/>decoder</a> and one for its <a for=/>encoder</a>.
<p>To find the pointers and their corresponding code points in an <a>index</a>,
let <var>lines</var> be the result of splitting the resource's contents on U+000A.
Then remove each item in <var>lines</var> that is the empty string or starts with U+0023.
Then the pointers and their corresponding code points are found by splitting each item in <var>lines</var> on U+0009.
The first subitem is the pointer (as a decimal number) and the second is the corresponding code point (as a hexadecimal number).
Other subitems are not relevant.
<p class="note no-backref">To signify changes an <a>index</a> includes an
<i>Identifier</i> and a <i>Date</i>. If an <i>Identifier</i> has
changed, so has the <a>index</a>.
<p>The <dfn>index code point</dfn> for <var>pointer</var> in
<var>index</var> is the code point corresponding to
<var>pointer</var> in <var>index</var>, or null if
<var>pointer</var> is not in <var>index</var>.
<p>The <dfn>index pointer</dfn> for <var>code point</var> in
<var>index</var> is the <em>first</em> pointer corresponding to
<var>code point</var> in <var>index</var>, or null if
<var>code point</var> is not in <var>index</var>.
<div class=note id=visualization>
<p>There is a non-normative visualization for each <a>index</a> other than
<a>index gb18030 ranges</a> and <a>index ISO-2022-JP katakana</a>. <a>index jis0208</a> also has an
alternative <a>Shift_JIS</a> visualization. Additionally, there is visualization of the Basic
Multilingual Plane coverage of each index other than <a>index gb18030 ranges</a> and
<a>index ISO-2022-JP katakana</a>.
<p>The legend for the visualizations is:
<ul class=visualizationlegend>
<li class=unmapped>Unmapped
<li class=mid>Two bytes in UTF-8
<li class="mid contiguous">Two bytes in UTF-8, code point follows immediately the code point of
previous pointer
<li class=upper>Three bytes in UTF-8 (non-PUA)
<li class="upper contiguous">Three bytes in UTF-8 (non-PUA), code point follows immediately the
code point of previous pointer
<li class=pua>Private Use
<li class="pua contiguous">Private Use, code point follows immediately the code point of previous
pointer
<li class=astral>Four bytes in UTF-8
<li class="astral contiguous">Four bytes in UTF-8, code point follows immediately the code point
of previous pointer
<li class=duplicate>Duplicate code point already mapped at an earlier index
<li class=compatibility>CJK Compatibility Ideograph
<li class=ext>CJK Unified Ideographs Extension A
</ul>
</div>
<p>These are the <a lt=index>indexes</a> defined by this
specification, excluding <a>index single-byte</a>, which have their own table:
<table>
<tbody><tr><th colspan=4><a>Index</a><th>Notes
<tr>
<td><dfn export>index Big5</dfn>
<td><a href=index-big5.txt>index-big5.txt</a>
<td><a href=big5.html>index Big5 visualization</a>
<td><a href=big5-bmp.html>index Big5 BMP coverage</a>
<td>This matches the Big5 standard in combination with the
Hong Kong Supplementary Character Set and other common extensions.
<tr>
<td><dfn export>index EUC-KR</dfn>
<td><a href=index-euc-kr.txt>index-euc-kr.txt</a>
<td><a href=euc-kr.html>index EUC-KR visualization</a>
<td><a href=euc-kr-bmp.html>index EUC-KR BMP coverage</a>
<td>This matches the KS X 1001 standard and the Unified Hangul Code, more commonly known together
as Windows Codepage 949. It covers the Hangul Syllables block of Unicode in its entirety. The
Hangul block whose top left corner in the visualization is at pointer 9026 is in the Unicode
order. Taken separately, the rest of the Hangul syllables in this index are in the Unicode order,
too.
<tr>
<td><dfn export>index gb18030</dfn>
<td><a href=index-gb18030.txt>index-gb18030.txt</a>
<td><a href=gb18030.html>index gb18030 visualization</a>
<td><a href=gb18030-bmp.html>index gb18030 BMP coverage</a>
<td>This matches the GB18030-2005 standard for code points encoded as two bytes, except for
0xA3 0xA0 which maps to U+3000 to be compatible with deployed content. This index covers the
CJK Unified Ideographs block of Unicode in its entirety. Entries from that block that are above or
to the left of (the first) U+3000 in the visualization are in the Unicode order.
<!-- https://bugzilla.mozilla.org/show_bug.cgi?id=131837
https://bugs.webkit.org/show_bug.cgi?id=17014
https://www.w3.org/Bugs/Public/show_bug.cgi?id=25396
https://github.com/whatwg/encoding/issues/17 -->
<tr>
<td><dfn export>index gb18030 ranges</dfn>
<td colspan=3><a href=index-gb18030-ranges.txt>index-gb18030-ranges.txt</a>
<td>This <a>index</a> works different from all others. Listing all code points would result
in over a million items whereas they can be represented neatly in 207 ranges combined with trivial
limit checks. It therefore only superficially matches the GB18030-2005 standard for code points
encoded as four bytes. See also <a>index gb18030 ranges code point</a> and
<a>index gb18030 ranges pointer</a> below.
<tr>
<td><dfn export>index jis0208</dfn>
<td><a href=index-jis0208.txt>index-jis0208.txt</a>
<td><a href=jis0208.html>index jis0208 visualization</a>, <a href=shift_jis.html>Shift_JIS visualization</a>
<td><a href=jis0208-bmp.html>index jis0208 BMP coverage</a>
<td>This is the JIS X 0208 standard including formerly proprietary
extensions from IBM and NEC.
<!-- NEC = Nippon Electronics Corporation -->
<tr>
<td><dfn export>index jis0212</dfn>
<td><a href=index-jis0212.txt>index-jis0212.txt</a>
<td><a href=jis0212.html>index jis0212 visualization</a>
<td><a href=jis0212-bmp.html>index jis0212 BMP coverage</a>
<td>This is the JIS X 0212 standard. It is only used by the <a>EUC-JP decoder</a>
due to lack of widespread support elsewhere.
<!--
No JIX X 0212 EUC-JP encoder support:
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
https://code.google.com/p/chromium/issues/detail?id=78847
No JIX X 0212 ISO-2022-JP support:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=26885
-->
<tr>
<td><dfn export>index ISO-2022-JP katakana</dfn>
<td colspan=3><a href=index-iso-2022-jp-katakana.txt>index-iso-2022-jp-katakana.txt</a>
<td>This maps halfwidth to fullwidth katakana as per Unicode Normalization Form KC, except that
U+FF9E and U+FF9F map to U+309B and U+309C rather than U+3099 and U+309A. It is only used by the
<a>ISO-2022-JP encoder</a>. [[UNICODE]]
</table>
<p>The <dfn>index gb18030 ranges code point</dfn> for <var>pointer</var> is
the return value of these steps:
<ol>
<li><p>If <var>pointer</var> is greater than 39419 and less than
189000, or <var>pointer</var> is greater than 1237575, return null.
<li><p>If <var>pointer</var> is 7457, return code point U+E7C7.
<!-- 7457 is 0x81 0x35 0xF4 0x37 -->
<li><p>Let <var>offset</var> be the last pointer in <a>index gb18030 ranges</a> that is less than
or equal to <var>pointer</var> and let <var>code point offset</var> be its corresponding code
point.
<li><p>Return a code point whose value is
<var>code point offset</var> + <var>pointer</var> − <var>offset</var>.
</ol>
<p>The <dfn>index gb18030 ranges pointer</dfn> for <var>code point</var> is
the return value of these steps:
<ol>
<li><p>If <var>code point</var> is U+E7C7, return pointer 7457.
<li><p>Let <var>offset</var> be the last code point in <a>index gb18030 ranges</a> that is less
than or equal to <var>code point</var> and let <var>pointer offset</var> be its corresponding
pointer.
<li><p>Return a pointer whose value is
<var>pointer offset</var> + <var>code point</var> − <var>offset</var>.
</ol>
<p>The <dfn>index Shift_JIS pointer</dfn> for <var>code point</var> is the return value of these
steps:
<ol>
<li>
<p>Let <var>index</var> be <a>index jis0208</a> excluding all entries whose pointer is in
the range 8272 to 8835, inclusive.
<!-- selected NEC duplicates from IBM extensions later in the index; need to use IBM
extensions when going back to bytes -->
<p class=note>The <a>index jis0208</a> contains duplicate code points so the exclusion of
these entries causes later code points to be used.
<li><p>Return the <a>index pointer</a> for <var>code point</var> in
<var>index</var>.
</ol>
<p>The <dfn>index Big5 pointer</dfn> for <var>code point</var> is the return value of
these steps:
<ol>
<li>
<p>Let <var>index</var> be <a>index Big5</a> excluding all entries whose pointer is less
than (0xA1 - 0x81) × 157.
<p class=note>Avoid returning Hong Kong Supplementary Character Set extensions literally.
<li>
<p>If <var>code point</var> is U+2550, U+255E, U+2561, U+256A, U+5341, or U+5345,
return the <em>last</em> pointer corresponding to <var>code point</var> in
<var>index</var>.
<!-- https://www.w3.org/Bugs/Public/show_bug.cgi?id=27878 -->
<p class=note>There are other duplicate code points, but for those the <em>first</em> pointer is
to be used.
<li><p>Return the <a>index pointer</a> for <var>code point</var> in
<var>index</var>.
</ol>
<hr>
<p class="note no-backref">All <a lt=index>indexes</a> are also available as a non-normative
<a href=indexes.json>indexes.json</a> resource. (<a>Index gb18030 ranges</a> has a slightly
different format here, to be able to represent ranges.)
<h2 id=specification-hooks>Hooks for standards</h2>
<div class=note>
<p>The algorithms defined below (<a>decode</a>, <a>UTF-8 decode</a>,
<a>UTF-8 decode without BOM</a>, <a>UTF-8 decode without BOM or fail</a>, <a for=/>encode</a>,
<a>UTF-8 encode</a>, and <a>BOM sniff</a>) are intended for usage by other standards.
<p>For decoding, <a>UTF-8 decode</a> is to be used by new formats. For identifiers or byte
sequences within a format or protocol, use <a>UTF-8 decode without BOM</a> or
<a>UTF-8 decode without BOM or fail</a>.
<p>For encoding, <a>UTF-8 encode</a> is to be used.
<p>Standards are strongly discouraged from using <a>decode</a>, <a for=/>encode</a>, and
<a>BOM sniff</a>, except as needed for compatibility.
<p>The <a>get an encoding</a> algorithm is to be used to turn a <a>label</a> into an
<a for=/>encoding</a>.
<p>Standards are to ensure that the streams they pass to the <a for=/>encode</a> and
<a>UTF-8 encode</a> algorithms are effectively scalar value streams, i.e., they contain no
<a>surrogates</a>.
</div>
<p>To <dfn export>decode</dfn> a byte stream <var>stream</var> using
fallback encoding <var>encoding</var>, run these steps:
<ol>
<li><p>Let <var>BOMEncoding</var> be the result of <a>BOM sniffing</a> <var>stream</var>.
<li>
<p>If <var>BOMEncoding</var> is non-null:
<ol>
<li><p>Set <var>encoding</var> to <var>BOMEncoding</var>.
<li><p><a>Read</a> three bytes from <var>stream</var>, if <var>BOMEncoding</var> is <a>UTF-8</a>;
otherwise <a>read</a> two bytes. (Do nothing with those bytes.)
</ol>
<p class=note>For compatibility with deployed content, the byte order mark is more authoritative
than anything else. In a context where HTTP is used this is in violation of the semantics of the
`<code>Content-Type</code>` header.
<li><p>Let <var>output</var> be a scalar value <a for=/>stream</a>.
<li><p><a>Run</a> <var>encoding</var>'s
<a for=/>decoder</a> with <var>stream</var> and <var>output</var>.
<li><p>Return <var>output</var>.
</ol>
<p>To <dfn export>UTF-8 decode</dfn> a byte stream <var>stream</var>, run
these steps:
<ol>
<li><p>Let <var>buffer</var> be an empty byte sequence.
<li><p><a>Read</a> three bytes from <var>stream</var>
into <var>buffer</var>.
<li><p>If <var>buffer</var> does not match 0xEF 0xBB 0xBF,
<a>prepend</a> <var>buffer</var> to <var>stream</var>.
<li><p>Let <var>output</var> be a scalar value <a for=/>stream</a>.
<li><p><a>Run</a> <a>UTF-8</a>'s
<a for=/>decoder</a> with <var>stream</var> and <var>output</var>.
<li><p>Return <var>output</var>.
</ol>
<p>To <dfn export>UTF-8 decode without BOM</dfn> a byte stream <var>stream</var>, run these
steps:
<ol>
<li><p>Let <var>output</var> be a scalar value <a for=/>stream</a>.
<li><p><a>Run</a> <a>UTF-8</a>'s
<a for=/>decoder</a> with <var>stream</var> and <var>output</var>.
<li><p>Return <var>output</var>.
</ol>
<p>To <dfn export>UTF-8 decode without BOM or fail</dfn> a byte stream <var>stream</var>, run these
steps:
<!-- Needed by https://tools.ietf.org/html/rfc6455#section-8.1 and
https://webassembly.github.io/spec/js-api/#dom-module-customsections-moduleobject-sectionname
-->
<ol>
<li><p>Let <var>output</var> be a scalar value stream.
<li><p>Let <var>potentialError</var> be the result of <a>running</a>
<a>UTF-8</a>'s <a for=/>decoder</a> with <var>stream</var>, <var>output</var>, and
"<code>fatal</code>".
<li><p>If <var>potentialError</var> is <a>error</a>, return failure.
<li><p>Return <var>output</var>.
</ol>
<hr>
<p>To <dfn export>encode</dfn> a scalar value stream <var>stream</var> using encoding
<var>encoding</var>, run these steps:
<ol>
<li><p>Assert: <var>encoding</var> is not <a>replacement</a>, <a>UTF-16BE</a> or
<a>UTF-16LE</a>.
<li><p>Let <var>output</var> be a byte <a for=/>stream</a>.
<li><p><a>Run</a> <var>encoding</var>'s
<a for=/>encoder</a> with <var>stream</var>, <var>output</var>, and "<code>html</code>".
<li><p>Return <var>output</var>.
</ol>
<p class="note no-backref">This is mostly a legacy hook for URLs and HTML forms. Layering
<a>UTF-8 encode</a> on top is safe as it never triggers
<a>errors</a>.
[[URL]]
[[HTML]]
<p>To <dfn export>UTF-8 encode</dfn> a scalar value stream <var>stream</var>, return the result of
<a lt=encode for=/>encoding</a> <var>stream</var> using encoding <a>UTF-8</a>.
<hr>
<p>To <dfn export>BOM sniff</dfn> a byte stream <var>stream</var>, run these steps:
<ol>
<li><p>Wait until <var>stream</var> has three bytes available or the <a>end-of-stream</a> has been
reached, whichever comes first.
<li>
<p>For each of the rows in the table below, starting with the first one and going down, if
<var>stream</var> <a for="byte sequence">starts with</a> the bytes given in the first column,
return the <a for=/>encoding</a> given in the cell in the second column of that row. (Do not
consume those bytes.)
<table>
<tbody><tr><th>Byte order mark<th>Encoding
<tr><td>0xEF 0xBB 0xBF<td><a>UTF-8</a>
<tr><td>0xFE 0xFF<td><a>UTF-16BE</a>
<tr><td>0xFF 0xFE<td><a>UTF-16LE</a>
</table>
<li><p>Return null.
</ol>
<p class=note>This hook is a workaround for the fact that <a>decode</a> has no way to communicate