-
Notifications
You must be signed in to change notification settings - Fork 78
/
Copy pathencoding.bs
3582 lines (2719 loc) · 135 KB
/
encoding.bs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<pre class=metadata>
Group: WHATWG
H1: Encoding
Shortname: encoding
Text Macro: TWITTER encodings
Text Macro: LATESTRD 2024-12
Abstract: The Encoding Standard defines encodings and their JavaScript API.
Translation: ja https://triple-underscore.github.io/Encoding-ja.html
Markup Shorthands: css off
Translate IDs: dictdef-textdecoderoptions textdecoderoptions,dictdef-textdecodeoptions textdecodeoptions,index section-index
</pre>
<link rel=stylesheet href=visualization-colors.css>
<h2 id=preface>Preface</h2>
<p>The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the
universal coded character set. Therefore for new protocols and formats, as well as
existing formats deployed in new contexts, this specification requires (and defines) the
UTF-8 encoding.
<p>The other (legacy) encodings have been defined to some extent in the past. However,
user agents have not always implemented them in the same way, have not always used the
same labels, and often differ in dealing with undefined and former proprietary areas of
encodings. This specification addresses those gaps so that new user agents do not have to
reverse engineer encoding implementations and existing user agents can converge.
<p>In particular, this specification defines all those encodings, their algorithms to go
from bytes to scalar values and back, and their canonical names and identifying labels.
This specification also defines an API to expose part of the encoding algorithms to
JavaScript.
<p>User agents have also significantly deviated from the labels listed in the
<a href=https://www.iana.org/assignments/character-sets/character-sets.xhtml>IANA Character Sets registry</a>.
To stop spreading legacy encodings further, this specification is exhaustive about the
aforementioned details and therefore has no need for the registry. In particular, this
specification does not provide a mechanism for extending any aspect of encodings.
<h2 id=security-background>Security background</h2>
<p>There is a set of encoding security issues when the producer and consumer do not agree on the
encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was
reported in 2011 where a <a>Shift_JIS</a> lead byte 0x82 was used to “mask” a 0x22 trail byte in a
JSON resource of which an attacker could control some field. The producer did not see the problem
even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD and
therefore changed the overall interpretation as U+0022 is an important delimiter. Decoders of
encodings that use multiple bytes for scalar values now require that in case of an illegal byte
combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”. For the
aforementioned sequence the output would be U+FFFD U+0022. (As an unfortunate exception to this, the
<a>gb18030 decoder</a> will “mask” up to one such byte at <a>end-of-queue</a>.)
<p>This is a larger issue for encodings that map anything that is an <a>ASCII byte</a> to something
that is not an <a>ASCII code point</a>, when there is no lead byte present. These are
“ASCII-incompatible” encodings and other than <a>ISO-2022-JP</a> and <a>UTF-16BE/LE</a>, which are
unfortunately required due to deployed content, they are not supported. (Investigation is
<a href=https://github.com/whatwg/encoding/issues/8 lt="Add more labels to the replacement encoding">ongoing</a>
whether more labels of other such encodings can be mapped to the <a>replacement</a> encoding, rather
than the unknown encoding fallback.) An example attack is injecting carefully crafted content into a
resource and then encouraging the user to override the encoding, resulting in, e.g., script
execution.
<p>Encoders used by URLs found in HTML and HTML's form feature can also result in slight information
loss when an encoding is used that cannot represent all scalar values. E.g., when a resource uses
the <a>windows-1252</a> encoding a server will not be able to distinguish between an end user
entering “💩” and “&#128169;” into a form.
<p>The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons
that is now the mandatory encoding for all things.
<p class=note>See also the <a href=#browser-ui>Browser UI</a> chapter.
<h2 id=terminology>Terminology</h2>
<p>This specification depends on the Infra Standard. [[!INFRA]]
<p>Hexadecimal numbers are prefixed with "0x".
<p>In equations, all numbers are integers, addition is represented by "+", subtraction by "−",
multiplication by "×", integer division by "/" (returns the quotient), modulo by "%" (returns the
remainder of an integer division), logical left shifts by "<<", logical right shifts by ">>",
bitwise AND by "&", and bitwise OR by "|".
<p>For logical right shifts operands must have at least twenty-one bits precision.
<hr>
<p>An <dfn id=concept-stream export>I/O queue</dfn> is a type of <a for=/>list</a> with
<a for=list>items</a> of a particular type (i.e., <a>bytes</a> or <a>scalar values</a>).
<dfn id="end-of-stream" export>End-of-queue</dfn> is a special <a for=list>item</a> that can be
present in <a for=/>I/O queues</a> of any type and it signifies that there are no more
<a for=list>items</a> in the queue.
<div class=note>
<p>There are two ways to use an <a for=/>I/O queue</a>: in immediate mode, to represent I/O data
stored in memory, and in streaming mode, to represent data coming in from the network. Immediate
queues have <a>end-of-queue</a> as their last item, whereas streaming queues need not have it, and
so their <a for="I/O queue">read</a> operation might block.
<p>It is expected that streaming <a for=/>I/O queues</a> will be created empty, and that new
<a for=list>items</a> will be <a for="I/O queue">pushed</a> to it as data comes in from the
network. When the underlying network stream closes, an <a>end-of-queue</a> item is to be
<a for="I/O queue">pushed</a> into the queue.
<p>Since reading from a streaming <a for=/>I/O queue</a> might block, streaming
<a for=/>I/O queues</a> are not to be used from an <a for=/>event loop</a>. They are to be used
<a>in parallel</a> instead.
</div>
<p>To <dfn id=concept-stream-read for="I/O queue" export>read</dfn> an <a for=list>item</a> from an
<a for=/>I/O queue</a> <var>ioQueue</var>, run these steps:
<ol>
<li><p>If <var>ioQueue</var> <a for=list>is empty</a>, then wait until its <a for=list>size</a> is
at least 1.
<li><p>If <var>ioQueue</var>[0] is <a>end-of-queue</a>, then return <a>end-of-queue</a>.
<li><p><a for=list>Remove</a> <var>ioQueue</var>[0] and return it.
</ol>
<p>To <a for="I/O queue">read</a> a number <var>number</var> of <a for=list>items</a> from
<var>ioQueue</var>, run these steps:
<ol>
<li><p>Let <var>readItems</var> be « ».
<li>
<p>Perform the following step <var>number</var> times:
<ol>
<li><p><a for=list>Append</a> to <var>readItems</var> the result of
<a for="I/O queue">reading</a> an item from <var>ioQueue</var>.
</ol>
</li>
<li><p><a for=list>Remove</a> <a>end-of-queue</a> from <var>readItems</var>.
<li><p>Return <var>readItems</var>.
</ol>
<p>To <dfn for="I/O queue" export>peek</dfn> a number <var>number</var> of <a for=list>items</a>
from an <a for=/>I/O queue</a> <var>ioQueue</var>, run these steps:
<ol>
<li><p>Wait until either <var>ioQueue</var>'s <a for=list>size</a> is equal to or greater than
<var>number</var>, or <var>ioQueue</var> <a for=list>contains</a> <a>end-of-queue</a>, whichever
comes first.
<li><p>Let <var>prefix</var> be « ».
<li>
<p><a for=list>For each</a> <var>n</var> in <a>the range</a> 1 to <var>number</var>, inclusive:
<ol>
<li><p>If <var>ioQueue</var>[<var>n</var>] is <a>end-of-queue</a>, <a>break</a>.
<li><p>Otherwise, <a for=list>append</a> <var>ioQueue</var>[<var>n</var>] to <var>prefix</var>.
</ol>
</li>
<li><p>Return <var>prefix</var>.
</ol>
<p>To <dfn id=concept-stream-push for="I/O queue" export>push</dfn> an <a for=list>item</a>
<var>item</var> to an <a for=/>I/O queue</a> <var>ioQueue</var>, run these steps:
<ol>
<li>
<p>If the last <a for=list>item</a> in <var>ioQueue</var> is <a>end-of-queue</a>, then:
<ol>
<li><p>If <var>item</var> is <a>end-of-queue</a>, do nothing.
<li><p>Otherwise, <a for=list>insert</a> <var>item</var> before the last <a for=list>item</a> in
<var>ioQueue</var>.
</ol>
</li>
<li><p>Otherwise, <a for=list>append</a> <var>item</var> to <var>ioQueue</var>.
</ol>
<p>To <a for="I/O queue">push</a> a sequence of items to an <a for=/>I/O queue</a>
<var>ioQueue</var> is to push each item in the sequence to <var>ioQueue</var>, in the given order.
<p>To <dfn id=concept-stream-prepend for="I/O queue">restore</dfn> an <a for=list>item</a> other
than <a>end-of-queue</a> to an <a for=/>I/O queue</a>, perform the <a for=/>list</a>
<a for=list>prepend</a> operation. To <a for="I/O queue">restore</a> a <a for=/>list</a> of
<a for=list>items</a> excluding <a>end-of-queue</a> to an <a for=/>I/O queue</a>, insert those
items, in the given order, before the first item in the queue.
<p class=example id=example-tokens>Inserting the bytes « 0xF0, 0x9F » in an I/O queue
« 0x92 0xA9, <a>end-of-queue</a> », results in an I/O queue
« 0xF0, 0x9F, 0x92 0xA9, <a>end-of-queue</a> ». The next item to be read would be 0xF0. <!-- 💩 -->
<p>To <dfn for="from I/O queue">convert</dfn> an <a for=/>I/O queue</a> <var>ioQueue</var> into a
<a for=/>list</a>, <a>string</a>, or <a>byte sequence</a>, return the result of
<a for="I/O queue">reading</a> an indefinite number of <a for=list>items</a> from
<var>ioQueue</var>.
<p>To <dfn for="to I/O queue">convert</dfn> a <a for=/>list</a>, <a>string</a>, or
<a>byte sequence</a> <var>input</var> into an <a for=/>I/O queue</a>, run these steps:
<ol>
<li><p>Assert: if <var>input</var> is a <a for=/>list</a>, then it does not <a for=list>contain</a>
<a>end-of-queue</a>.
<li><p>Return an <a for=/>I/O queue</a> containing the <a for=list>items</a> in <var>input</var>,
in order, followed by <a>end-of-queue</a>.
</ol>
<p class=XXX>The Infra standard is expected to define some infrastructure around type conversions.
See <a href="https://github.com/whatwg/infra/issues/319">whatwg/infra issue #319</a>. [[INFRA]]
<p class=note><a for=/>I/O queues</a> are defined as <a for=/>lists</a>, not
<a spec=infra>queues</a>, because they feature a <a for="I/O queue">restore</a> operation. However,
this restore operation is an internal detail of the algorithms in this specification, and is not to
be used by other standards. Implementations are free to find alternative ways to implement such
algorithms, as detailed in [[#implementation-considerations]].
<hr>
<p>To obtain a <dfn>scalar value from surrogates</dfn>, given a <a for=/>leading surrogate</a>
<var>leading</var> and a <a for=/>trailing surrogate</a> <var>trailing</var>, return
0x10000 + ((<var>leading</var> − 0xD800) << 10) + (<var>trailing</var> − 0xDC00).
<h2 id=encodings>Encodings</h2>
<p>An <dfn export>encoding</dfn> defines a mapping from a <a>scalar value</a> sequence to
a <a>byte</a> sequence (and vice versa). Each <a for=/>encoding</a> has a
<dfn id=name export for=encoding>name</dfn>, and one or more
<dfn id=label export for=encoding lt=label>labels</dfn>.
<p class="note no-backref">This specification defines three <a for=/>encodings</a> with the same
names as <i>encoding schemes</i> defined in the Unicode standard: <a>UTF-8</a>, <a>UTF-16LE</a>, and
<a>UTF-16BE</a>. The <a for=/>encodings</a> differ from the <i>encoding schemes</i> by byte order
mark (also known as BOM) handling not being part of the <a for=/>encodings</a> themselves and
instead being part of wrapper algorithms in this specification, whereas byte order mark handling is
part of the definition of the <i>encoding schemes</i> in the Unicode Standard. <a>UTF-8</a> used
together with the <a>UTF-8 decode</a> algorithm matches the <i>encoding scheme</i> of the same name.
This specification does not provide wrapper algorithms that would combine with <a>UTF-16LE</a> and
<a>UTF-16BE</a> to match the similarly-named <i>encoding schemes</i>. [[UNICODE]]
<h3 id=encoders-and-decoders>Encoders and decoders</h3>
<p>Each <a for=/>encoding</a> has an associated <dfn>decoder</dfn> and most of them have an
associated <dfn>encoder</dfn>. Instances of <a for=/>decoders</a> and <a for=/>encoders</a> have a
<dfn>handler</dfn> algorithm and might also have state. A <a>handler</a> algorithm takes an input
<a for=/>I/O queue</a> and an <a for=list>item</a>, and returns
<dfn>finished</dfn>, one or more <a for=list>items</a>, <dfn>error</dfn>
optionally with a <a>code point</a>, or <dfn>continue</dfn>.
<p class="note no-backref">The <a>replacement</a> and <a>UTF-16BE/LE</a> <a for=/>encodings</a> have
no <a for=/>encoder</a>.
<p>An <dfn>error mode</dfn> as used below is "<code>replacement</code>" or "<code>fatal</code>" for
a <a for=/>decoder</a> and "<code>fatal</code>" or "<code>html</code>" for an <a for=/>encoder</a>.
<p class=note>An XML processor would set <a for=/>error mode</a> to "<code>fatal</code>".
[[XML]]
<p class=note>"<code>html</code>" exists as <a for=/>error mode</a> due to HTML forms requiring a
non-terminating legacy <a for=/>encoder</a>. The "<code>html</code>" <a for=/>error mode</a> causes
a sequence to be emitted that cannot be distinguished from legitimate input and can therefore lead
to silent data loss. Developers are strongly encouraged to use the <a>UTF-8</a>
<a for=/>encoding</a> to prevent this from happening. [[HTML]]
<hr>
<p>To <dfn lt="process a queue|processing a queue" id=concept-encoding-run>process a queue</dfn>
given an <a for=/>encoding</a>'s <a for=/>decoder</a> or <a for=/>encoder</a> instance
<var>encoderDecoder</var>, <a for=/>I/O queue</a> <var>input</var>, <a for=/>I/O queue</a>
<var>output</var>, and <a for=/>error mode</a> <var>mode</var>:
<ol>
<li>
<p>While true:
<ol>
<li><p>Let <var>result</var> be the result of <a>processing an item</a> with the result of
<a>reading</a> from <var>input</var>, <var>encoderDecoder</var>, <var>input</var>,
<var>output</var>, and <var>mode</var>.
<li><p>If <var>result</var> is not <a>continue</a>, then return <var>result</var>.
</ol>
</ol>
<p>To <dfn lt="process an item|processing an item" id=concept-encoding-process>process an item</dfn>
given an <a for=list>item</a> <var>item</var>, <a for=/>encoding</a>'s <a for=/>encoder</a> or
<a for=/>decoder</a> instance <var>encoderDecoder</var>, <a for=/>I/O queue</a> <var>input</var>,
<a for=/>I/O queue</a> <var>output</var>, and <a for=/>error mode</a> <var>mode</var>:
<ol>
<li><p>Assert: if <var>encoderDecoder</var> is an <a for=/>encoder</a> instance, <var>mode</var> is
not "<code>replacement</code>".
<li><p>Assert: if <var>encoderDecoder</var> is a <a for=/>decoder</a> instance, <var>mode</var> is
not "<code>html</code>".
<li><p>Assert: if <var>encoderDecoder</var> is an <a for=/>encoder</a> instance, <var>item</var> is
not a <a>surrogate</a>.
<li><p>Let <var>result</var> be the result of running <var>encoderDecoder</var>'s <a>handler</a> on
<var>input</var> and <var>item</var>.
<li>
<p>If <var>result</var> is <a>finished</a>:
<ol>
<li><p><a>Push</a> <a>end-of-queue</a> to <var>output</var>.
<li><p>Return <var>result</var>.
</ol>
</li>
<li>
<p>Otherwise, if <var>result</var> is one or more <a for=list>items</a>:
<ol>
<li><p>Assert: if <var>encoderDecoder</var> is a <a for=/>decoder</a> instance, <var>result</var>
does not contain any <a>surrogates</a>.
<li><p><a>Push</a> <var>result</var> to <var>output</var>.
</ol>
<li>
<p>Otherwise, if <var>result</var> is an <a>error</a>, switch on <var>mode</var> and run the
associated steps:
<dl class=switch>
<dt>"<code>replacement</code>"
<dd><a>Push</a> U+FFFD (�) to <var>output</var>.
<dt>"<code>html</code>"
<dd><a>Push</a> 0x26 (&), 0x23 (#), followed by the shortest sequence of 0x30 (0) to
0x39 (9), inclusive, representing <var>result</var>'s <a>code point</a>'s
<a for="code point">value</a> in base ten, followed by 0x3B (;) to <var>output</var>.
<dt>"<code>fatal</code>"
<dd>Return <var>result</var>.
</dl>
<li><p>Return <a>continue</a>.
</ol>
<h3 id=names-and-labels>Names and labels</h3>
<p>The table below lists all <a for=/>encodings</a>
and their <a for=encoding>labels</a> user agents must support.
User agents must not support any other <a for=/>encodings</a>
or <a for=encoding>labels</a>.
<p class=note>For each encoding, <a lt="ASCII lowercase">ASCII-lowercasing</a> its
<a for=encoding>name</a> yields one of its <a for=encoding>labels</a>.
<p>Authors must use the <a>UTF-8</a> <a for=/>encoding</a> and must use its
(<a>ASCII case-insensitive</a>) "<code>utf-8</code>" <a for=encoding>label</a> to identify it.
<p>New protocols and formats, as well as existing formats deployed in new contexts, must use the
<a>UTF-8</a> <a for=/>encoding</a> exclusively. If these protocols and formats need to expose the
<a for=/>encoding</a>'s <a for=encoding>name</a> or <a for=encoding>label</a>, they must expose it
as "<code>utf-8</code>".
<!-- “UTF-8 or death” — Emil A Eklund -->
<p>To
<dfn export lt="get an encoding|getting an encoding" id=concept-encoding-get>get an encoding</dfn>
from a string <var>label</var>, run these steps:
<ol>
<li><p>Remove any leading and trailing <a>ASCII whitespace</a> from
<var>label</var>.
<li><p>If <var>label</var> is an <a>ASCII case-insensitive</a> match for any of the labels listed
in the table below, then return the corresponding <a for=/>encoding</a>; otherwise return failure.
</ol>
<p class=note>This is a more basic and restrictive algorithm of mapping labels to
<a for=/>encodings</a> than
<a href=https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching>section 1.4 of Unicode Technical Standard #22</a>
prescribes, as that is necessary to be compatible with deployed content.
<table>
<thead>
<tr>
<th>Name
<th>Labels
<tbody>
<tr><th colspan=2><a href=#the-encoding>The Encoding</a>
<tr>
<td rowspan=6><a>UTF-8</a>
<td>"<code>unicode-1-1-utf-8</code>"
<tr><td>"<code>unicode11utf8</code>"
<tr><td>"<code>unicode20utf8</code>"
<tr><td>"<code>utf-8</code>"
<tr><td>"<code>utf8</code>"
<tr><td>"<code>x-unicode20utf8</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-single-byte-encodings>Legacy single-byte encodings</a>
<tr>
<td rowspan=4><a>IBM866</a>
<td>"<code>866</code>"
<tr><td>"<code>cp866</code>"
<tr><td>"<code>csibm866</code>"
<tr><td>"<code>ibm866</code>"
<tr>
<td rowspan=9><a>ISO-8859-2</a>
<td>"<code>csisolatin2</code>"
<tr><td>"<code>iso-8859-2</code>"
<tr><td>"<code>iso-ir-101</code>"
<tr><td>"<code>iso8859-2</code>"
<tr><td>"<code>iso88592</code>"
<tr><td>"<code>iso_8859-2</code>"
<tr><td>"<code>iso_8859-2:1987</code>"
<tr><td>"<code>l2</code>"
<tr><td>"<code>latin2</code>"
<tr>
<td rowspan=9><a>ISO-8859-3</a>
<td>"<code>csisolatin3</code>"
<tr><td>"<code>iso-8859-3</code>"
<tr><td>"<code>iso-ir-109</code>"
<tr><td>"<code>iso8859-3</code>"
<tr><td>"<code>iso88593</code>"
<tr><td>"<code>iso_8859-3</code>"
<tr><td>"<code>iso_8859-3:1988</code>"
<tr><td>"<code>l3</code>"
<tr><td>"<code>latin3</code>"
<tr>
<td rowspan=9><a>ISO-8859-4</a>
<td>"<code>csisolatin4</code>"
<tr><td>"<code>iso-8859-4</code>"
<tr><td>"<code>iso-ir-110</code>"
<tr><td>"<code>iso8859-4</code>"
<tr><td>"<code>iso88594</code>"
<tr><td>"<code>iso_8859-4</code>"
<tr><td>"<code>iso_8859-4:1988</code>"
<tr><td>"<code>l4</code>"
<tr><td>"<code>latin4</code>"
<tr>
<td rowspan=8><a>ISO-8859-5</a>
<td>"<code>csisolatincyrillic</code>"
<tr><td>"<code>cyrillic</code>"
<tr><td>"<code>iso-8859-5</code>"
<tr><td>"<code>iso-ir-144</code>"
<tr><td>"<code>iso8859-5</code>"
<tr><td>"<code>iso88595</code>"
<tr><td>"<code>iso_8859-5</code>"
<tr><td>"<code>iso_8859-5:1988</code>"
<tr>
<td rowspan=14><a>ISO-8859-6</a>
<td>"<code>arabic</code>"
<tr><td>"<code>asmo-708</code>"
<tr><td>"<code>csiso88596e</code>"
<tr><td>"<code>csiso88596i</code>"
<tr><td>"<code>csisolatinarabic</code>"
<tr><td>"<code>ecma-114</code>"
<tr><td>"<code>iso-8859-6</code>"
<tr><td>"<code>iso-8859-6-e</code>"
<tr><td>"<code>iso-8859-6-i</code>"
<tr><td>"<code>iso-ir-127</code>"
<tr><td>"<code>iso8859-6</code>"
<tr><td>"<code>iso88596</code>"
<tr><td>"<code>iso_8859-6</code>"
<tr><td>"<code>iso_8859-6:1987</code>"
<tr>
<td rowspan=12><a>ISO-8859-7</a>
<td>"<code>csisolatingreek</code>"
<tr><td>"<code>ecma-118</code>"
<tr><td>"<code>elot_928</code>"
<tr><td>"<code>greek</code>"
<tr><td>"<code>greek8</code>"
<tr><td>"<code>iso-8859-7</code>"
<tr><td>"<code>iso-ir-126</code>"
<tr><td>"<code>iso8859-7</code>"
<tr><td>"<code>iso88597</code>"
<tr><td>"<code>iso_8859-7</code>"
<tr><td>"<code>iso_8859-7:1987</code>"
<tr><td>"<code>sun_eu_greek</code>"
<tr>
<td rowspan=11><a>ISO-8859-8</a>
<td>"<code>csiso88598e</code>"
<tr><td>"<code>csisolatinhebrew</code>"
<tr><td>"<code>hebrew</code>"
<tr><td>"<code>iso-8859-8</code>"
<tr><td>"<code>iso-8859-8-e</code>"
<tr><td>"<code>iso-ir-138</code>"
<tr><td>"<code>iso8859-8</code>"
<tr><td>"<code>iso88598</code>"
<tr><td>"<code>iso_8859-8</code>"
<tr><td>"<code>iso_8859-8:1988</code>"
<tr><td>"<code>visual</code>"
<tr>
<td rowspan=3><a>ISO-8859-8-I</a>
<td>"<code>csiso88598i</code>"
<tr><td>"<code>iso-8859-8-i</code>"
<tr><td>"<code>logical</code>"
<tr>
<td rowspan=7><a>ISO-8859-10</a>
<td>"<code>csisolatin6</code>"
<tr><td>"<code>iso-8859-10</code>"
<tr><td>"<code>iso-ir-157</code>"
<tr><td>"<code>iso8859-10</code>"
<tr><td>"<code>iso885910</code>"
<tr><td>"<code>l6</code>"
<tr><td>"<code>latin6</code>"
<tr>
<td rowspan=3><a>ISO-8859-13</a>
<td>"<code>iso-8859-13</code>"
<tr><td>"<code>iso8859-13</code>"
<tr><td>"<code>iso885913</code>"
<tr>
<td rowspan=3><a>ISO-8859-14</a>
<td>"<code>iso-8859-14</code>"
<tr><td>"<code>iso8859-14</code>"
<tr><td>"<code>iso885914</code>"
<tr>
<td rowspan=6><a>ISO-8859-15</a>
<td>"<code>csisolatin9</code>"
<tr><td>"<code>iso-8859-15</code>"
<tr><td>"<code>iso8859-15</code>"
<tr><td>"<code>iso885915</code>"
<tr><td>"<code>iso_8859-15</code>"
<tr><td>"<code>l9</code>"
<tr>
<td><a>ISO-8859-16</a>
<td>"<code>iso-8859-16</code>"
<tr>
<td rowspan=5><a>KOI8-R</a>
<td>"<code>cskoi8r</code>"
<tr><td>"<code>koi</code>"
<tr><td>"<code>koi8</code>"
<tr><td>"<code>koi8-r</code>"
<tr><td>"<code>koi8_r</code>"
<tr>
<td rowspan=2><a>KOI8-U</a>
<td>"<code>koi8-ru</code>"
<tr><td>"<code>koi8-u</code>"
<tr>
<td rowspan=4><a>macintosh</a>
<td>"<code>csmacintosh</code>"
<tr><td>"<code>mac</code>"
<tr><td>"<code>macintosh</code>"
<tr><td>"<code>x-mac-roman</code>"
<tr>
<td rowspan=6><a>windows-874</a>
<td>"<code>dos-874</code>"
<tr><td>"<code>iso-8859-11</code>"
<tr><td>"<code>iso8859-11</code>"
<tr><td>"<code>iso885911</code>"
<tr><td>"<code>tis-620</code>"
<tr><td>"<code>windows-874</code>"
<tr>
<td rowspan=3><a>windows-1250</a>
<td>"<code>cp1250</code>"
<tr><td>"<code>windows-1250</code>"
<tr><td>"<code>x-cp1250</code>"
<tr>
<td rowspan=3><a>windows-1251</a>
<td>"<code>cp1251</code>"
<tr><td>"<code>windows-1251</code>"
<tr><td>"<code>x-cp1251</code>"
<tr>
<td rowspan=17><a>windows-1252</a>
<td>"<code>ansi_x3.4-1968</code>"
<tr><td>"<code>ascii</code>"
<tr><td>"<code>cp1252</code>"
<tr><td>"<code>cp819</code>"
<tr><td>"<code>csisolatin1</code>"
<tr><td>"<code>ibm819</code>"
<tr><td>"<code>iso-8859-1</code>"
<tr><td>"<code>iso-ir-100</code>"
<tr><td>"<code>iso8859-1</code>"
<tr><td>"<code>iso88591</code>"
<tr><td>"<code>iso_8859-1</code>"
<tr><td>"<code>iso_8859-1:1987</code>"
<tr><td>"<code>l1</code>"
<tr><td>"<code>latin1</code>"
<tr><td>"<code>us-ascii</code>"
<tr><td>"<code>windows-1252</code>"
<tr><td>"<code>x-cp1252</code>"
<tr>
<td rowspan=3><a>windows-1253</a>
<td>"<code>cp1253</code>"
<tr><td>"<code>windows-1253</code>"
<tr><td>"<code>x-cp1253</code>"
<tr>
<td rowspan=12><a>windows-1254</a>
<td>"<code>cp1254</code>"
<tr><td>"<code>csisolatin5</code>"
<tr><td>"<code>iso-8859-9</code>"
<tr><td>"<code>iso-ir-148</code>"
<tr><td>"<code>iso8859-9</code>"
<tr><td>"<code>iso88599</code>"
<tr><td>"<code>iso_8859-9</code>"
<tr><td>"<code>iso_8859-9:1989</code>"
<tr><td>"<code>l5</code>"
<tr><td>"<code>latin5</code>"
<tr><td>"<code>windows-1254</code>"
<tr><td>"<code>x-cp1254</code>"
<tr>
<td rowspan=3><a>windows-1255</a>
<td>"<code>cp1255</code>"
<tr><td>"<code>windows-1255</code>"
<tr><td>"<code>x-cp1255</code>"
<tr>
<td rowspan=3><a>windows-1256</a>
<td>"<code>cp1256</code>"
<tr><td>"<code>windows-1256</code>"
<tr><td>"<code>x-cp1256</code>"
<tr>
<td rowspan=3><a>windows-1257</a>
<td>"<code>cp1257</code>"
<tr><td>"<code>windows-1257</code>"
<tr><td>"<code>x-cp1257</code>"
<tr>
<td rowspan=3><a>windows-1258</a>
<td>"<code>cp1258</code>"
<tr><td>"<code>windows-1258</code>"
<tr><td>"<code>x-cp1258</code>"
<tr>
<td rowspan=2><a>x-mac-cyrillic</a>
<td>"<code>x-mac-cyrillic</code>"
<tr><td>"<code>x-mac-ukrainian</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-multi-byte-chinese-(simplified)-encodings>Legacy multi-byte Chinese (simplified) encodings</a>
<tr>
<td rowspan=9><a>GBK</a>
<td>"<code>chinese</code>"
<tr><td>"<code>csgb2312</code>"
<tr><td>"<code>csiso58gb231280</code>"
<tr><td>"<code>gb2312</code>"
<tr><td>"<code>gb_2312</code>"
<tr><td>"<code>gb_2312-80</code>"
<tr><td>"<code>gbk</code>"
<tr><td>"<code>iso-ir-58</code>"
<tr><td>"<code>x-gbk</code>"
<tr>
<td><a>gb18030</a>
<td>"<code>gb18030</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-multi-byte-chinese-(traditional)-encodings>Legacy multi-byte Chinese (traditional) encodings</a>
<tr>
<td rowspan=5><a>Big5</a>
<td>"<code>big5</code>"
<tr><td>"<code>big5-hkscs</code>"
<tr><td>"<code>cn-big5</code>"
<tr><td>"<code>csbig5</code>"
<tr><td>"<code>x-x-big5</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-multi-byte-japanese-encodings>Legacy multi-byte Japanese encodings</a>
<tr>
<td rowspan=3><a>EUC-JP</a>
<td>"<code>cseucpkdfmtjapanese</code>"
<tr><td>"<code>euc-jp</code>"
<tr><td>"<code>x-euc-jp</code>"
<tr>
<td rowspan=2><a>ISO-2022-JP</a>
<td>"<code>csiso2022jp</code>"
<tr><td>"<code>iso-2022-jp</code>"
<tr>
<td rowspan=8><a>Shift_JIS</a>
<td>"<code>csshiftjis</code>"
<tr><td>"<code>ms932</code>"
<tr><td>"<code>ms_kanji</code>"
<tr><td>"<code>shift-jis</code>"
<tr><td>"<code>shift_jis</code>"
<tr><td>"<code>sjis</code>"
<tr><td>"<code>windows-31j</code>"
<tr><td>"<code>x-sjis</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-multi-byte-korean-encodings>Legacy multi-byte Korean encodings</a>
<tr>
<td rowspan=10><a>EUC-KR</a>
<td>"<code>cseuckr</code>"
<tr><td>"<code>csksc56011987</code>"
<tr><td>"<code>euc-kr</code>"
<tr><td>"<code>iso-ir-149</code>"
<tr><td>"<code>korean</code>"
<tr><td>"<code>ks_c_5601-1987</code>"
<tr><td>"<code>ks_c_5601-1989</code>"
<tr><td>"<code>ksc5601</code>"
<tr><td>"<code>ksc_5601</code>"
<tr><td>"<code>windows-949</code>"
<tbody>
<tr><th colspan=2><a href=#legacy-miscellaneous-encodings>Legacy miscellaneous encodings</a>
<tr>
<td rowspan=6><a>replacement</a>
<td>"<code>csiso2022kr</code>"
<tr><td>"<code>hz-gb-2312</code>"
<tr><td>"<code>iso-2022-cn</code>"
<tr><td>"<code>iso-2022-cn-ext</code>"
<tr><td>"<code>iso-2022-kr</code>"
<tr><td>"<code>replacement</code>"
<tr>
<td rowspan=2><a>UTF-16BE</a>
<td>"<code>unicodefffe</code>"
<tr><td>"<code>utf-16be</code>"
<tr>
<td rowspan=7><a>UTF-16LE</a>
<td>"<code>csunicode</code>"
<tr><td>"<code>iso-10646-ucs-2</code>"
<tr><td>"<code>ucs-2</code>"
<tr><td>"<code>unicode</code>"
<tr><td>"<code>unicodefeff</code>"
<tr><td>"<code>utf-16</code>"
<tr><td>"<code>utf-16le</code>"
<tr>
<td><a>x-user-defined</a>
<td>"<code>x-user-defined</code>"
</table>
<p class=note>All <a for=/>encodings</a> and their <a for=encoding>labels</a> are also available as
non-normative <a href=encodings.json>encodings.json</a> resource.
<p class=note id=supported-encodings>The set of supported <a for=/>encodings</a> is primarily based
on the intersection of the sets supported by major browser engines when the development of this
standard started, while removing encodings that were rarely used legitimately but that could be used
in attacks. The inclusion of some encodings is questionable in the light of anecdotal evidence of
the level of use by existing Web content. That is, while they have been broadly supported by
browsers, it is unclear if they are broadly used by Web content. However, an effort has not been
made to eagerly remove <a>single-byte encodings</a> that were broadly supported by browsers or are
part of the ISO 8859 series. In particular, the necessity of the inclusion of <a>IBM866</a>,
<a>macintosh</a>, <a>x-mac-cyrillic</a>, <a>ISO-8859-3</a>, <a>ISO-8859-10</a>, <a>ISO-8859-14</a>,
and <a>ISO-8859-16</a> is doubtful for the purpose of supporting existing content, but there are no
plans to remove these.</p>
<h3 id=output-encodings>Output encodings</h3>
<p>To <dfn export>get an output encoding</dfn> from an <a for=/>encoding</a>
<var>encoding</var>, run these steps:
<ol>
<li><p>If <var>encoding</var> is <a>replacement</a> or <a>UTF-16BE/LE</a>, then return
<a>UTF-8</a>.
<li><p>Return <var>encoding</var>.
</ol>
<p class=note>The <a>get an output encoding</a> algorithm is useful for URL parsing and HTML
form submission, which both need exactly this.
<h2 id=indexes>Indexes</h2>
<p>Most legacy <a for=/>encodings</a> make use of an <dfn id=index>index</dfn>. An
<a>index</a> is an ordered list of entries, each entry consisting of a pointer and a
corresponding code point. Within an <a>index</a> pointers are unique and code points can be
duplicated.
<p class="note no-backref">An efficient implementation likely has two
<a lt=index>indexes</a> per <a for=/>encoding</a>. One optimized for its
<a for=/>decoder</a> and one for its <a for=/>encoder</a>.
<p>To find the pointers and their corresponding code points in an <a>index</a>,
let <var>lines</var> be the result of splitting the resource's contents on U+000A.
Then remove each item in <var>lines</var> that is the empty string or starts with U+0023.
Then the pointers and their corresponding code points are found by splitting each item in <var>lines</var> on U+0009.
The first subitem is the pointer (as a decimal number) and the second is the corresponding code point (as a hexadecimal number).
Other subitems are not relevant.
<p class="note no-backref">To signify changes an <a>index</a> includes an
<i>Identifier</i> and a <i>Date</i>. If an <i>Identifier</i> has
changed, so has the <a>index</a>.
<p>The <dfn>index code point</dfn> for <var>pointer</var> in
<var>index</var> is the code point corresponding to
<var>pointer</var> in <var>index</var>, or null if
<var>pointer</var> is not in <var>index</var>.
<p>The <dfn>index pointer</dfn> for <var>code point</var> in
<var>index</var> is the <em>first</em> pointer corresponding to
<var>code point</var> in <var>index</var>, or null if
<var>code point</var> is not in <var>index</var>.
<div class=note id=visualization>
<p>There is a non-normative visualization for each <a>index</a> other than
<a>index gb18030 ranges</a> and <a>index ISO-2022-JP katakana</a>. <a>index jis0208</a> also has an
alternative <a>Shift_JIS</a> visualization. Additionally, there is visualization of the Basic
Multilingual Plane coverage of each index other than <a>index gb18030 ranges</a> and
<a>index ISO-2022-JP katakana</a>.
<p>The legend for the visualizations is:
<ul class=visualizationlegend>
<li class=unmapped>Unmapped
<li class=mid>Two bytes in UTF-8
<li class="mid contiguous">Two bytes in UTF-8, code point follows immediately the code point of
previous pointer
<li class=upper>Three bytes in UTF-8 (non-PUA)
<li class="upper contiguous">Three bytes in UTF-8 (non-PUA), code point follows immediately the
code point of previous pointer
<li class=pua>Private Use
<li class="pua contiguous">Private Use, code point follows immediately the code point of previous
pointer
<li class=astral>Four bytes in UTF-8
<li class="astral contiguous">Four bytes in UTF-8, code point follows immediately the code point
of previous pointer
<li class=duplicate>Duplicate code point already mapped at an earlier index
<li class=compatibility>CJK Compatibility Ideograph
<li class=ext>CJK Unified Ideographs Extension A
</ul>
</div>
<p>These are the <a lt=index>indexes</a> defined by this
specification, excluding <a>index single-byte</a>, which have their own table:
<table>
<tbody><tr><th colspan=4><a>Index</a><th>Notes
<tr>
<td><dfn export>index Big5</dfn>
<td><a href=index-big5.txt>index-big5.txt</a>
<td><a href=big5.html>index Big5 visualization</a>
<td><a href=big5-bmp.html>index Big5 BMP coverage</a>
<td>This matches the Big5 standard in combination with the
Hong Kong Supplementary Character Set and other common extensions.
<tr>
<td><dfn export>index EUC-KR</dfn>
<td><a href=index-euc-kr.txt>index-euc-kr.txt</a>
<td><a href=euc-kr.html>index EUC-KR visualization</a>
<td><a href=euc-kr-bmp.html>index EUC-KR BMP coverage</a>
<td>This matches the KS X 1001 standard and the Unified Hangul Code, more commonly known together
as Windows Codepage 949. It covers the Hangul Syllables block of Unicode in its entirety. The
Hangul block whose top left corner in the visualization is at pointer 9026 is in the Unicode
order. Taken separately, the rest of the Hangul syllables in this index are in the Unicode order,
too.
<tr>
<td><dfn export>index gb18030</dfn>
<td><a href=index-gb18030.txt>index-gb18030.txt</a>
<td><a href=gb18030.html>index gb18030 visualization</a>
<td><a href=gb18030-bmp.html>index gb18030 BMP coverage</a>
<td>This matches the GB18030-2022 standard for code points encoded as two bytes, except for
0xA3 0xA0 which maps to U+3000 to be compatible with deployed content. This index covers the
CJK Unified Ideographs block of Unicode in its entirety. Entries from that block that are above or
to the left of (the first) U+3000 in the visualization are in the Unicode order.
<!-- https://bugzilla.mozilla.org/show_bug.cgi?id=131837
https://bugs.webkit.org/show_bug.cgi?id=17014
https://www.w3.org/Bugs/Public/show_bug.cgi?id=25396
https://github.com/whatwg/encoding/issues/17 -->
<tr>
<td><dfn export>index gb18030 ranges</dfn>
<td colspan=3><a href=index-gb18030-ranges.txt>index-gb18030-ranges.txt</a>
<td>This <a>index</a> works different from all others. Listing all code points would result
in over a million items whereas they can be represented neatly in 207 ranges combined with trivial
limit checks. It therefore only superficially matches the GB18030-2000 standard for code points
encoded as four bytes. The change for the GB18030-2005 revision is handled inline by the
<a>index gb18030 ranges code point</a> and <a>index gb18030 ranges pointer</a> algorithms below
that accompany this index. And the changes for the GB18030-2022 revision are handled differently
again to not further increase the number of byte sequences mapping to Private Use code points. The
relevant Private Use code points are mapped in the <a>gb18030 encoder</a> directly through a side
table to preserve compatibility with how they were mapped before.
<tr>
<td><dfn export>index jis0208</dfn>
<td><a href=index-jis0208.txt>index-jis0208.txt</a>
<td><a href=jis0208.html>index jis0208 visualization</a>, <a href=shift_jis.html>Shift_JIS visualization</a>
<td><a href=jis0208-bmp.html>index jis0208 BMP coverage</a>
<td>This is the JIS X 0208 standard including formerly proprietary
extensions from IBM and NEC.
<!-- NEC = Nippon Electronics Corporation -->
<tr>
<td><dfn export>index jis0212</dfn>
<td><a href=index-jis0212.txt>index-jis0212.txt</a>
<td><a href=jis0212.html>index jis0212 visualization</a>
<td><a href=jis0212-bmp.html>index jis0212 BMP coverage</a>
<td>This is the JIS X 0212 standard. It is only used by the <a>EUC-JP decoder</a>
due to lack of widespread support elsewhere.
<!--
No JIX X 0212 EUC-JP encoder support:
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
https://code.google.com/p/chromium/issues/detail?id=78847
No JIX X 0212 ISO-2022-JP support:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=26885
-->
<tr>
<td><dfn export>index ISO-2022-JP katakana</dfn>
<td colspan=3><a href=index-iso-2022-jp-katakana.txt>index-iso-2022-jp-katakana.txt</a>
<td>This maps halfwidth to fullwidth katakana as per Unicode Normalization Form KC, except that
U+FF9E and U+FF9F map to U+309B and U+309C rather than U+3099 and U+309A. It is only used by the
<a>ISO-2022-JP encoder</a>. [[UNICODE]]
</table>
<p>The <dfn>index gb18030 ranges code point</dfn> for <var>pointer</var> is
the return value of these steps:
<ol>
<li><p>If <var>pointer</var> is greater than 39419 and less than
189000, or <var>pointer</var> is greater than 1237575, return null.
<li><p>If <var>pointer</var> is 7457, return code point U+E7C7.
<!-- 7457 is 0x81 0x35 0xF4 0x37 -->
<li><p>Let <var>offset</var> be the last pointer in <a>index gb18030 ranges</a> that is less than
or equal to <var>pointer</var> and let <var>code point offset</var> be its corresponding code
point.
<li><p>Return a code point whose value is
<var>code point offset</var> + <var>pointer</var> − <var>offset</var>.
</ol>
<p>The <dfn>index gb18030 ranges pointer</dfn> for <var>code point</var> is
the return value of these steps:
<ol>
<li><p>If <var>code point</var> is U+E7C7, return pointer 7457.
<li><p>Let <var>offset</var> be the last code point in <a>index gb18030 ranges</a> that is less
than or equal to <var>code point</var> and let <var>pointer offset</var> be its corresponding
pointer.
<li><p>Return a pointer whose value is
<var>pointer offset</var> + <var>code point</var> − <var>offset</var>.
</ol>
<p>The <dfn>index Shift_JIS pointer</dfn> for <var>code point</var> is the return value of these
steps:
<ol>
<li>
<p>Let <var>index</var> be <a>index jis0208</a> excluding all entries whose pointer is in
the range 8272 to 8835, inclusive.
<!-- selected NEC duplicates from IBM extensions later in the index; need to use IBM
extensions when going back to bytes -->
<p class=note>The <a>index jis0208</a> contains duplicate code points so the exclusion of
these entries causes later code points to be used.
<li><p>Return the <a>index pointer</a> for <var>code point</var> in
<var>index</var>.
</ol>
<p>The <dfn>index Big5 pointer</dfn> for <var>code point</var> is the return value of
these steps:
<ol>
<li>
<p>Let <var>index</var> be <a>index Big5</a> excluding all entries whose pointer is less
than (0xA1 - 0x81) × 157.
<p class=note>Avoid returning Hong Kong Supplementary Character Set extensions literally.
<li>
<p>If <var>code point</var> is U+2550, U+255E, U+2561, U+256A, U+5341, or U+5345,
return the <em>last</em> pointer corresponding to <var>code point</var> in
<var>index</var>.
<!-- https://www.w3.org/Bugs/Public/show_bug.cgi?id=27878 -->
<p class=note>There are other duplicate code points, but for those the <em>first</em> pointer is
to be used.
<li><p>Return the <a>index pointer</a> for <var>code point</var> in
<var>index</var>.
</ol>
<hr>
<p class="note no-backref">All <a lt=index>indexes</a> are also available as a non-normative
<a href=indexes.json>indexes.json</a> resource. (<a>Index gb18030 ranges</a> has a slightly
different format here, to be able to represent ranges.)
<h2 id=specification-hooks>Hooks for standards</h2>
<div class=note>
<p>The algorithms defined below (<a>UTF-8 decode</a>, <a>UTF-8 decode without BOM</a>,
<a>UTF-8 decode without BOM or fail</a>, and <a>UTF-8 encode</a>) are intended for usage by other
standards.
<p>For decoding, <a>UTF-8 decode</a> is to be used by new formats. For identifiers or byte
sequences within a format or protocol, use <a>UTF-8 decode without BOM</a> or
<a>UTF-8 decode without BOM or fail</a>.
<p>For encoding, <a>UTF-8 encode</a> is to be used.
<p>Standards are to ensure that the input I/O queues they pass to <a>UTF-8 encode</a> (as well as
the legacy <a>encode</a>) are effectively I/O queues of scalar values, i.e., they contain no
<a>surrogates</a>.
<p>These hooks (as well as <a>decode</a> and <a>encode</a>) will block until the input I/O queue
has been consumed in its entirety. In order to use the output tokens as they are pushed into the
stream, callers are to invoke the hooks with an empty output I/O queue and read from it
<a>in parallel</a>. Note that some care is needed when using
<a>UTF-8 decode without BOM or fail</a>, as any error found during decoding will prevent the
<a>end-of-queue</a> item from ever being pushed into the output I/O queue.
</div>
<p>To <dfn export>UTF-8 decode</dfn> an I/O queue of bytes <var>ioQueue</var> given an optional I/O
queue of scalar values <var>output</var> (default « »), run these steps:
<ol>