-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathKaalep-2022-trondschrift.html
2069 lines (1986 loc) · 201 KB
/
Kaalep-2022-trondschrift.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html><html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>You can’t suggest that?! Comparisons and improvements of speller error models</title>
<!--Generated on Wed Aug 31 04:49:11 2022 by LaTeXML (version 0.8.6) http://dlmf.nist.gov/LaTeXML/.-->
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
<link rel="stylesheet" href="ltx-listings.css" type="text/css">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">You can’t suggest that?!
<br class="ltx_break">Comparisons and improvements of speller error
models</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Heiki-Jaan Kaalep, Flammie Pirinen, Sjur Nørstebø Moshagen
<br class="ltx_break">Tartu ülikool (Kaalep), UiT Norgga árktalaš universitehta (Pirinen, Moshagen
</span></span>
</div>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p">In this article, we study correction of spelling errors, specifically on how the
spelling errors are made and how can we model them computationally in order
to fix them. The article describes two different approaches to generating
spelling correction suggestions for three Uralic languages: Estonian, North
Sámi and South Sámi. The first approach of modelling spelling errors is
rule-based, where experts write rules that describe the kind of errors that
are made, and these are compiled into a finite-state automaton that models
the errors. The second is data driven, where we show a machine learning
algorithm a list of errors that humans have made, and it creates a neural
network that can model the errors. Both approaches require collections of
misspelling lists and understanding its contents; therefore, we also
describe the actual errors we have seen in detail. We find that while both
approaches create error correction systems, with current resources the
expert-built systems are still more reliable.</p>
</div>
<div id="p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Keywords: Spell-Checking, rule-based, fsa, machine learning, sámi languages, estonian</span></p>
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section" style="font-size:90%;">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The ultimate speller only accepts correct words, finds all spelling errors, and
always gives the one and only relevant suggestion. This speller will never
exist, but it is the ultimate speller we strive to achieve. In this article we
explore a few ideas in that direction, and apply them to three languages found
in the </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">GiellaLT</span><span class="ltx_text" style="font-size:90%;">
infrastructure</span><span id="footnote1" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">1</span></span>
<a href="https://giellalt.github.io/" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">https://giellalt.github.io/</a></span></span></span><span class="ltx_text" style="font-size:90%;">: North Sámi, South
Sámi and Estonian. More precisely, this article looks at the error model, and
how to improve the suggestions given.</span></p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">To that end, our goal is to reduce the noise level (increase precision) by
generating as few irrelevant suggestions as possible, and when in doubt, give no
suggestion at all rather than risk giving irrelevant suggestions; this is in
contrast with e.g. Hunspell</span><span id="footnote2" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">2</span></span>
<a href="https://hunspell.github.io/" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">https://hunspell.github.io/</a></span></span></span><span class="ltx_text" style="font-size:90%;">
(</span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib283" title="Hunmorph: open source word analysis" class="ltx_ref">29</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">) and the rest of the Xspell family (Ispell,
Aspell</span><span id="footnote3" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">3</span></span>
<a href="http://aspell.net" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">http://aspell.net</a></span></span></span><span class="ltx_text" style="font-size:90%;">, Myspell,
nuspell</span><span id="footnote4" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">4</span></span>
<a href="https://nuspell.github.io" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">https://nuspell.github.io</a></span></span></span><span class="ltx_text" style="font-size:90%;">, etc). While pursuing this
goal, we try to understand the reasons behind mistyping, and assume that
classifying the errors will give us some insight. Having this insight, it might
be possible to find ways for increasing recall as well.</span></p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">An attempt to find regularities in misspellings naturally invokes the idea that
</span><span class="ltx_text" style="font-size:90%;">one might try machine learning for this purpose; one should use all tools
available for achieving one’s goal.</span></p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The approaches that will be investigated are the following:</span></p>
</div>
<div id="S1.p5" class="ltx_para">
<ul id="S1.I1" class="ltx_itemize">
<li id="S1.I1.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item"><span class="ltx_text" style="font-size:90%;">•</span></span>
<div id="S1.I1.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">hand-crafted regex error model</span></p>
</div>
</li>
<li id="S1.I1.i2" class="ltx_item" style="list-style-type:none;padding-top:-2.0pt;">
<span class="ltx_tag ltx_tag_item"><span class="ltx_text" style="font-size:90%;">•</span></span>
<div id="S1.I1.i2.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">machine-learned error model</span></p>
</div>
</li>
</ul>
</div>
<div id="S1.p6" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The work described in this article says nothing about coverage, i.e. how many
words flagged by the speller are real errors and how many are actually correct
words, missing from the speller’s vocabulary; or how many misspelled words are
falsely recognized as correct. We limit ourselves to real misspellings.</span></p>
</div>
<div id="S1.p7" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The article is organized as follows: first, there is a short overview of earlier
work. Following that, we’ll describe the methods used for developing new error
models. We then describe the misspelling lists used for development, testing and
evaluation. After that we say a few words about the types of errors in these
lists, followed by a short description of the main features of the languages and
their orthography, focusing on the parts relevant to this paper. We then
describe the new error models in detail, starting with a short overview of our
baseline error model, after which we evaluate the performance of the new error
models. Finally, there is a discussion on the outcome, and a conclusion.
</span></p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section" style="font-size:90%;">
<span class="ltx_tag ltx_tag_section">2 </span>Earlier work</h2>
<div id="S2.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">A lot of work has been done on spelling corrections—we give an overview of the
literature here—although most of it looks at English and closely or
typologically related languages. See
e.g. </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib140" title="Techniques for automatically correcting words in text" class="ltx_ref">17</a>, <a href="#bib.bib106" title="Survey of automatic spelling correction" class="ltx_ref">13</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">. Working with languages with
a complex morphology and phonology does offer some additional challenges, and
minority and indigenous languages with a recent writing culture adds to that
challenge, also, not a lot of work has been done in this area.</span></p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Finite-state language models have been used in spell-checking and correction for
a while, one of the most recent approaches that is the basis of our system as
well is </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib231" title="State-of-the-art in weighted finite-state spell-checking" class="ltx_ref">26</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">. Within the Sámi language context, the work has
been done from </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib90" title="From xerox to aspell: a first prototype of a north sámi speller based on twol technology" class="ltx_ref">12</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;"> onwards.</span></p>
</div>
<div id="S2.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Substantial work on analysing North Sámi spelling errors was done in
</span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib13" title="Cállinmeattáhusaid guorran." class="ltx_ref">1</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">, and the insights gained were important
for the work done with the North Sámi speller in this article. To the best of
our knowledge, no other Sámi languages have been analysed with regard to
spelling errors, their classification and frequency.</span></p>
</div>
<div id="S2.p4" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Estonian spelling errors, that emerge while typing on a computer keyboard, have
not been described in publications. However, the Estonian spellers that were
created by Filosoft Ltd. in the beginning of the 1990ies (e.g. for Microsoft
</span><span class="ltx_text" style="font-size:90%;">Word) contain a suggestion module, and since their </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">C</span><span class="ltx_text" style="font-size:90%;">-language source
code has been made public</span><span id="footnote5" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">5</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">5</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">5</span></span>
<a href="https://github.com/Filosoft/vabamorf" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">https://github.com/Filosoft/vabamorf</a></span></span></span><span class="ltx_text" style="font-size:90%;">,
it has been possible to re-implement it as an FST.</span></p>
</div>
<div id="S2.p5" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">There is some prior work done on the general problem of error-correction using
neural networks and this is often suggested as the state-of-the-art currently,
so we have chosen to experiment on this approach as well.
In </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib147" title="Context-aware stand-alone neural spelling correction" class="ltx_ref">19</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;"> the authors use a neural model to determine the context
of the word, resulting in a better guess as to what was the word that the author
wanted to use.</span></p>
</div>
<div id="S2.p6" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">One of our central themes in this article lies in the usage and importance of a
public error corpus and/or list; an elaborate model for ordering correction
candidates: c.f. </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib81" title="A benchmark corpus of english misspellings and a minimally-supervised model for spelling correction" class="ltx_ref">10</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">. Different sources have different
types of errors, thus different strategies should be used, and different
recall-precision figures are expected: </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib31" title="Detecting and correcting spelling errors in high-quality dutch wikipedia text" class="ltx_ref">3</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">.</span></p>
</div>
<div id="S2.p7" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The GiellaLT framework </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib191" title="Building an open-source development infrastructure for language technology projects" class="ltx_ref">21</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;"> originated from the initial
work on proofing tools and morphological analysers for the Sámi languages, where
Trond Trosterud has been a major driving force (see
e.g. </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib284" title="Samisk språkteknologi" class="ltx_ref">22</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;"> and </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib286" title="Disambiguering av homonymi i nord- og lulesamisk" class="ltx_ref">30</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">). The
framework itself is language independent, but favours rule-based technologies
suitable for morphology rich, complex, and low-resource languages. The overall
goal is to support all language technology needs of indigenous and minority
</span><span class="ltx_text" style="font-size:90%;">languages, from text input to speech technology. It is constantly being
developed, and is the home for keyboards for 50 languages, and language models
for more than 130 languages. Many languages and keyboards are in daily use, and
is core to the digital life of several indigenous and minority language
communities.</span></p>
</div>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section" style="font-size:90%;">
<span class="ltx_tag ltx_tag_section">3 </span>Methods</h2>
<div id="S3.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">In this article we study two approaches to error-correction, a rule-based method
using two-level </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">finite-state transducers</span><span class="ltx_text" style="font-size:90%;">
(FST) </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib231" title="State-of-the-art in weighted finite-state spell-checking" class="ltx_ref">26</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">, and data-driven </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">neural network-based</span><span class="ltx_text" style="font-size:90%;">
(NN) </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib342" title="Long short-term memory" class="ltx_ref">14</a>, <a href="#bib.bib44" title="Improving historical spelling normalization with bi-directional LSTMs and multi-task learning" class="ltx_ref">7</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;"> </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">language
models</span><span class="ltx_text" style="font-size:90%;">. We call a method that corrects incorrect word-forms into correct ones
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">an error model</span><span class="ltx_text" style="font-size:90%;">.</span></p>
</div>
<section id="S3.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection" style="font-size:90%;">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>FST methods</h3>
<div id="S3.SS1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The finite-state spelling correction follows the model described
in </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib224" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">25</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">: a transducer that modifies the erroneous
string is composed with the speller transducer, which accepts only valid
wordforms. As a result, the suggestion transducer presents only modifications
that are also valid wordforms to the user.</span></p>
</div>
<div id="S3.SS1.p2" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Ideally, there would be only one suggestion, and this would be the right one.
The more suggestions there are, and the lower down the ranked list the correct
</span><span class="ltx_text" style="font-size:90%;">one is, the worse for the user; and the worst case is a long list of suggestions
without the correct one amongst them. So the suggestion transducer has a dual
goal: keep the number of the suggestions low, and rank them correctly. One may
ask whether it is better to provide no suggestion at all than to present the
correct one ranked as 9th, for example. Presently, we have no answer to this
question. What are the psychologically comfortable number and way of ranking, is
a question for future research on user studies; presently we just notice that
this aspect has to be taken into consideration.</span></p>
</div>
<div id="S3.SS1.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Limiting the number of suggestions can be achieved by either allowing fewer
modifications of the erroneous form, limiting the recognizable vocabulary of the
speller, or both. As an example: fewer modifications might mean that only edit
distance one is allowed, and limited speller vocabulary might mean that only
simplex words are allowed, while productively formed compounds are prohibited as
suggested corrections</span><span id="footnote6" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">6</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">6</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">6</span></span>
<span class="ltx_text" style="font-size:90%;">They would still be accepted by the speller. The
core idea is that one can use two different transducers or automata for the
speller: one to verify the text, including productive morphology, and another,
more restricted transducer, to verify suggestions.</span></span></span></span><span class="ltx_text" style="font-size:90%;">.</span></p>
</div>
<div id="S3.SS1.p4" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">With weighted transducers, we may attach different weights to different edit
operations and recognized wordforms. For example, interchanging </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">d</span><span class="ltx_text" style="font-size:90%;"> with
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t</span><span class="ltx_text" style="font-size:90%;"> adds a certain weight, and every component of a compound word adds
another weight. Suggestion ranking will follow from adding up all these weights,
and limiting their number may be based on cutting the list either above some
absolute weight, or above some absolute number of candidates. However, it is not
</span><span class="ltx_text" style="font-size:90%;">obvious how one should determine the right final weights and cutting points.
This article concentrates on modifications of the erroneous wordforms: what kind
of modifications should be made, and whether we can argue for attaching certain
weights to these modifications, in order to signal their likelihood.</span></p>
</div>
<div id="S3.SS1.p5" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Weights from the speller lexicon are also used: if two candidates result from
modifications with the same weight, then the one which gets smaller weight from
the speller is ranked first. We achieve this by having the modification weights
surpass the speller weights by a large margin; it is the modification which is
important, not the likelihood of the wordform itself. The speller lexicon
weights are partly based on frequency of words either in a corpus or by
linguistic intuition, and partly on expert-decided likelihood of the
morphological tags; more elaborate weighting schemes can be imagined, but that
is outside the scope of this article.</span></p>
</div>
</section>
<section id="S3.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection" style="font-size:90%;">
<span class="ltx_tag ltx_tag_subsection">3.2 </span>NN methods</h3>
<div id="S3.SS2.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">For neural error correction modelling, we are using a neural machine translation
approach. Within the neural machine translation framework, we use the
incorrectly written word-forms as source language, and the corrected word-forms
as target language. This logic allows us to train an error correction model with
an off-the-shelf neural machine translation toolkit. For this experiment we are
using OpenNMT-py</span><span id="footnote7" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">7</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">7</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">7</span></span>
<a href="https://opennmt.net/OpenNMT-py" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">https://opennmt.net/OpenNMT-py</a></span></span></span><span class="ltx_text" style="font-size:90%;"> </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib131" title="OpenNMT: open-source toolkit for neural machine translation" class="ltx_ref">16</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">
in its default settings, i.e. a translation model following the OpenNMT tutorial
</span><span class="ltx_text" style="font-size:90%;">on their website</span><span id="footnote8" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">8</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">8</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">8</span></span>
<a href="https://opennmt.net/OpenNMT-py/quickstart.html" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">https://opennmt.net/OpenNMT-py/quickstart.html</a></span></span></span><span class="ltx_text" style="font-size:90%;">.</span></p>
</div>
<div id="S3.SS2.p2" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">To limit the creativeness of neural suggestions, we restrict the corrections to
word-forms that are acceptable by the dictionary of the rule-based
spell-checker. That is, we take the list of </span><math id="S3.SS2.p2.m1" class="ltx_Math" alttext="n" display="inline"><mi mathsize="90%">n</mi></math><span class="ltx_text" style="font-size:90%;">-best translations from
OpenNMT-py and check it against the speller lexicon. Only the suggestions
accepted by the speller are included in the final suggestion list.</span></p>
</div>
</section>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section" style="font-size:90%;">
<span class="ltx_tag ltx_tag_section">4 </span>Lists of misspellings</h2>
<div id="S4.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">It is a truism that texts differ, depending on who creates them, for what
purpose and for what readership. Likewise, it is only natural to expect that the
errors made while writing depend on various factors. We are aware that the
misspelling lists we have at hand are not representative of the “general text
class” created by an “average writer”; so, in order to remain cautious when
interpreting our results, here are the main characteristics of the corpora that
these lists are derived from.</span></p>
</div>
<section id="S4.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection" style="font-size:90%;">
<span class="ltx_tag ltx_tag_subsection">4.1 </span>North Sámi</h3>
<div id="S4.SS1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The present day North Sámi orthography is from 1979, with some smaller
adjustments from 1985</span><span id="footnote9" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">9</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">9</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">9</span></span>
<span class="ltx_text" style="font-size:90%;">There have been several older orthographies going
back all the way to 1748.</span></span></span></span><span class="ltx_text" style="font-size:90%;">. The present orthography is thoroughly described
in </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib202" title="Nordsamisk grammatikk" class="ltx_ref">23</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">.
</span></p>
</div>
<div id="S4.SS1.p2" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">As a result of the Norwegian assimilation policy towards the Sámi people
throughout a major part of the 20th century, it is clear that most texts written
in the modern orthography are pretty recent. Modern North Sámi literacy is
correspondingly young, which is reflected in texts in the form of spelling and
other grammatical errors. In the material used
in </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib13" title="Cállinmeattáhusaid guorran." class="ltx_ref">1</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;"> there is about 4% spelling errors,
which is considerably more than in e.g. Norwegian or English texts produced by
native speakers. In </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib82" title="Patterns of misspellings in L2 and L1 English: a view from the ETS Spelling Corpus" class="ltx_ref">11</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">, where the majority of the texts are
written by non-native speakers of English at various levels of mastering the
language, the average number of spelling errors is 2.74%. And for the most
advanced writers contributing to the data set, the average number of
misspellings is well below 1%. That is, the average number of spelling errors
in North Sámi texts is considerably higher than in similar English texts. This
is expected given the short history of the orthography, the sociolinguistic
setting, the paucity of available text and thus written language exposure, and
the minority language status of North Sámi.</span></p>
</div>
<div id="S4.SS1.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The material used in developing, testing and evaluating the error models in this
paper has been collected over many years while developing various language
technology tools for North Sámi.</span><span id="footnote10" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">10</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">10</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">10</span></span>
<span class="ltx_text" style="font-size:90%;">Source code at:
</span><a href="https://github.com/giellalt/lang-sme" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">https://github.com/giellalt/lang-sme</a></span></span></span><span class="ltx_text" style="font-size:90%;"> Misspellings found in texts have
been collected in a separate text file, together with the expected correction
(usually based on the incorrect word form itself, sometimes also considering the
context where the misspelling was found). By the time of writing, the list of
</span><span class="ltx_text" style="font-size:90%;">typos contains 11 706 entries. Since the focus of research described here is
evaluating and developing error models, the list was filtered by removing
multiword expressions, false negatives</span><span id="footnote11" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">11</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">11</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">11</span></span>
<span class="ltx_text" style="font-size:90%;">misspellings accepted by the
speller as valid words.</span></span></span></span><span class="ltx_text" style="font-size:90%;">, and entries for which the given correction was not
recognized by the speller. The filtered list consists of 10 745 entries.</span></p>
</div>
<div id="S4.SS1.p4" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Given the development history of the list of typos, the source texts for the
misspellings can be assumed to be all sorts of texts, the majority of which are
found in SIKOR</span><span id="footnote12" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">12</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">12</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">12</span></span>
<cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib353" title="SIKOR uit norgga árktalaš universitehta ja norgga sámedikki sámi teakstačoakkáldat, veršuvdna 06.11.2018" class="ltx_ref">27</a><span class="ltx_text" style="font-size:90%;">]</span></cite></span></span></span><span class="ltx_text" style="font-size:90%;">. That is, the collection of
typos can be considered relatively representative of errors made by North Sámi
writers of various genres.</span></p>
</div>
<div id="S4.SS1.p5" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">For the machine learning experiment, the list was split in three according to
the usual 80-10-10: 80% for training, and 10% each for testing and development
/ validation. For the regular expression experiment, no such split was used, and
the list was both used to inform the developers about useful patterns, and to
evaluate the resulting error model.</span></p>
</div>
</section>
<section id="S4.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection" style="font-size:90%;">
<span class="ltx_tag ltx_tag_subsection">4.2 </span>South Sámi</h3>
<div id="S4.SS2.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The present day South Sámi orthography was formally decided upon in 1978,
although </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib50" title="Lohkede saemien. sørsamisk lesebok" class="ltx_ref">8</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;"> used an early version of that orthography.
South Sámi differs from most other Sámi languages and dialects due to a vast and
complex system of umlaut, c.f. </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib35" title="Sydsamisk grammatikk" class="ltx_ref">5</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">
and </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib161" title="Sørsamisk grammatikk" class="ltx_ref">20</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">. Although South Sámi does not have consonant
</span><span class="ltx_text" style="font-size:90%;">gradation as opposed to the other Sámi languages, it does have alternations in
consonant clusters and surrounding vowels depending on the syllable and foot
structure of the word. Various inflectional endings add zero, one or more
syllables to the base form, which forces a recast of the foot structure, which
can set off a chain reaction of various consonant and vowel changes. Two
examples:</span></p>
</div>
<div id="S4.SS2.p2" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">gåetie¿
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">g</span><span class="ltx_text" style="font-size:90%;">åetie gåatan gåatetje gåatatjasse //
</span><span class="ltx_text ltx_font_typewriter" style="font-size:90%;">g</span><span class="ltx_text" style="font-size:90%;">åetie+N+Sg+Nom gåetie+N+Sg+Ill gåetie+Dimin+N+Sg+Nom gåetie+Dimin+N+Sg+Ill //
‘House, into the house, little house, into the little house’ //
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">å</span><span class="ltx_text" style="font-size:90%;">eruve åerievasse åerievadtje åerievadtjese //
</span><span class="ltx_text ltx_font_typewriter" style="font-size:90%;">å</span><span class="ltx_text" style="font-size:90%;">eruve+N+Sg+Nom åeruve+N+Sg+Ill åeruve+Dimin+N+Sg+Nom åeruve+Dimin+N+Sg+Ill //
‘Squirrel, into the squirrel, little squirrel, into the little squirrel’ //</span></p>
</div>
<div id="S4.SS2.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">That is, the vowel of the second and third syllables changes as follows:
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">-ie-, -a-, -e-, -a-</span><span class="ltx_text" style="font-size:90%;"> for </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">gåetie</span><span class="ltx_text" style="font-size:90%;">, and </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">-u- + -e-, -ie- +
-a-</span><span class="ltx_text" style="font-size:90%;"> for </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">åeruve</span><span class="ltx_text" style="font-size:90%;">. The default illative case ending has two forms:
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">-asse</span><span class="ltx_text" style="font-size:90%;"> and </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">-ese</span><span class="ltx_text" style="font-size:90%;">, and the diminutive derivation also has two
forms: </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">-etje</span><span class="ltx_text" style="font-size:90%;"> and </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">-adtje</span><span class="ltx_text" style="font-size:90%;">. The form of the suffixes (illative
and diminutive in example </span><span class="ltx_ref ltx_missing_label ltx_ref_self" style="font-size:90%;">LABEL:gåetie</span><span class="ltx_text" style="font-size:90%;">) are solely dependent on the syllable
count, whereas some vowel changes also depend on the stem type. The umlaut of
the root vowel is triggered by the underlying vowel of both case and
derivational suffixes.</span></p>
</div>
<div id="S4.SS2.p4" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The South Sámi language community is just a fraction of the North Sámi, and with
correspondingly less production and exposure to the written language. Also, a
considerable portion of the population is in practice L2 speakers. This is
reflected in the misspelling list used for testing as a number of errors
relating to mixing vowel and inflectional endings, essentially miscounting the
syllables and thus applying the wrong suffix; an example of this taken from the
list can be seen in (</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">next example</span><span class="ltx_text" style="font-size:90%;">). (</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">next example</span><span class="ltx_text" style="font-size:90%;">) also contains other errors, like using
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ø</span><span class="ltx_text" style="font-size:90%;"> for correct </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ö</span><span class="ltx_text" style="font-size:90%;">, and mixing </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">s</span><span class="ltx_text" style="font-size:90%;"> and </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">sj</span><span class="ltx_text" style="font-size:90%;">.
Identifying each and every such case reliably is not trivial, identifying the
proportion of these errors to the rest is left as a topic for future research.</span></p>
</div>
<div id="S4.SS2.p5" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">vhkesjadtedh¿
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*</span><span class="ltx_text" style="font-size:90%;">Vyøhkesadtibie //
</span><span class="ltx_text ltx_font_typewriter" style="font-size:90%;">v</span><span class="ltx_text" style="font-size:90%;">yöhkesjadtedh+V+IV+Ind+Prs+Pl1 //
‘We help each other’ (wrong syllabification and thus suffix form) //
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">V</span><span class="ltx_text" style="font-size:90%;">yöhkesjadtebe //
</span><span class="ltx_text ltx_font_typewriter" style="font-size:90%;">v</span><span class="ltx_text" style="font-size:90%;">yöhkesjadtedh+V+IV+Ind+Prs+Pl1 //
‘We help each other’ (correct syllabification and suffix) //</span></p>
</div>
<div id="S4.SS2.p6" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Identifying the syllabic structure is not made easier by historic processes
leading to exceptions, so that instead of the regular pattern </span><math id="S4.SS2.p6.m1" class="ltx_Math" alttext="2+2+\cdots n\cdots+2/3" display="inline"><mrow><mn mathsize="90%">2</mn><mo mathsize="90%" stretchy="false">+</mo><mn mathsize="90%">2</mn><mo mathsize="90%" stretchy="false">+</mo><mrow><mi mathsize="90%" mathvariant="normal">⋯</mi><mo></mo><mi mathsize="90%">n</mi><mo></mo><mi mathsize="90%" mathvariant="normal">⋯</mi></mrow><mo mathsize="90%" stretchy="false">+</mo><mrow><mn mathsize="90%">2</mn><mo mathsize="90%" stretchy="false">/</mo><mn mathsize="90%">3</mn></mrow></mrow></math><span class="ltx_text" style="font-size:90%;">, you get </span><math id="S4.SS2.p6.m2" class="ltx_Math" alttext="3+2" display="inline"><mrow><mn mathsize="90%">3</mn><mo mathsize="90%" stretchy="false">+</mo><mn mathsize="90%">2</mn></mrow></math><span class="ltx_text" style="font-size:90%;">, or </span><math id="S4.SS2.p6.m3" class="ltx_Math" alttext="2+1" display="inline"><mrow><mn mathsize="90%">2</mn><mo mathsize="90%" stretchy="false">+</mo><mn mathsize="90%">1</mn></mrow></math><span class="ltx_text" style="font-size:90%;">, instead of the expected </span><math id="S4.SS2.p6.m4" class="ltx_Math" alttext="2+3" display="inline"><mrow><mn mathsize="90%">2</mn><mo mathsize="90%" stretchy="false">+</mo><mn mathsize="90%">3</mn></mrow></math><span class="ltx_text" style="font-size:90%;">, and </span><math id="S4.SS2.p6.m5" class="ltx_Math" alttext="3" display="inline"><mn mathsize="90%">3</mn></math><span class="ltx_text" style="font-size:90%;">.
Examples of these can be seen in (</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">next example</span><span class="ltx_text" style="font-size:90%;">).</span></p>
</div>
<div id="S4.SS2.p7" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">dåeriedidh¿
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">d</span><span class="ltx_text" style="font-size:90%;">åerie•dieh //
</span><span class="ltx_text ltx_font_typewriter" style="font-size:90%;">d</span><span class="ltx_text" style="font-size:90%;">åeriedidh+V+TV+Ind+Prs+Pl3 //
‘They are following’ (syllable structure: 2 + 1 ) //
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">d</span><span class="ltx_text" style="font-size:90%;">åerede•minie //
</span><span class="ltx_text ltx_font_typewriter" style="font-size:90%;">d</span><span class="ltx_text" style="font-size:90%;">åeriedidh+V+TV+Ger //
‘(In the process of) following’ (syllable structure: 3 + 2) //
</span></p>
</div>
<div id="S4.SS2.p8" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Complicating the issue further are loan words: how should their syllables be
counted and fit into the foot structure of South Sámi phonotactics? An example
of this can be seen in (</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">next example</span><span class="ltx_text" style="font-size:90%;">), with the misspelled form in (</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">next example</span><span class="ltx_text" style="font-size:90%;">a), and the
correct form in (</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">next example</span><span class="ltx_text" style="font-size:90%;">b). It is very clear that the misspelling of the case
suffix is caused by applying a wrong foot structure to the word form.</span></p>
</div>
<div id="S4.SS2.p9" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">wikipedia¿
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">W</span><span class="ltx_text" style="font-size:90%;">ikipe•dij:ese //
</span><span class="ltx_text ltx_font_typewriter" style="font-size:90%;">w</span><span class="ltx_text" style="font-size:90%;">ikipedije+N+Sg+Ill //
‘Into Wikipedia’ (wrong syllable structure: 3 + 3, and thus wrong suffix form) //
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">W</span><span class="ltx_text" style="font-size:90%;">iki•pedi•jasse //
</span><span class="ltx_text ltx_font_typewriter" style="font-size:90%;">w</span><span class="ltx_text" style="font-size:90%;">ikipedije+N+Sg+Ill //
‘Into Wikipedia’ (Correct syllable structure: 2 + 2 + 2) //</span></p>
</div>
<div id="S4.SS2.p10" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Finally, the South Sámi orthographic rules recommend that one uses Norwegian
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">æ</span><span class="ltx_text" style="font-size:90%;"> and Swedish </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ö</span><span class="ltx_text" style="font-size:90%;">. Up until recently, following these rules
require that one knows how to produce the vowel letter from the other side of
the border, and it also requires an extra key press: AltGr + the standard vowel.
In practice, most people didn’t care, and the South Sámi list is full of
Norwegian </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ø</span><span class="ltx_text" style="font-size:90%;">’s and Swedish </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ä</span><span class="ltx_text" style="font-size:90%;">’s. These are considered
misspellings by the spelling checker, and they also contribute to the complexity
of correcting South Sámi. It is not uncommon to find spelling errors with an
editing distance of four and more; in the test list of typos 48 such cases are
found, ≈4.2% of the corpus.</span></p>
</div>
<div id="S4.SS2.p11" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">As was the case with North Sámi, the list of typos for South Sámi is collected
while developing the morphological analyser, based on material that is mostly
found in SIKOR (</span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib353" title="SIKOR uit norgga árktalaš universitehta ja norgga sámedikki sámi teakstačoakkáldat, veršuvdna 06.11.2018" class="ltx_ref">27</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">). The cleaned version of that manually
built list mentioned above contains only 1 154 entries. A separate list of
typo-correction pairs was extracted from a manually marked up corpus of
gold-standard text. That token list contains 8 325 non-unique entries, and was
used for training a machine learning model, testing and evaluation, using the
common 80-10-10 split. This list, extracted from the gold standard corpus, was
not used when building the manually crafted regex error model.</span></p>
</div>
</section>
<section id="S4.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection" style="font-size:90%;">
<span class="ltx_tag ltx_tag_subsection">4.3 </span>Estonian</h3>
<div id="S4.SS3.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Estonian orthography in its present form was adopted during the third quarter of
</span><span class="ltx_text" style="font-size:90%;">the 19th century. It is modelled after Finnish orthography; the proposal was
made by Adolf Ivar Arwidsson </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib20" title="Ueber die ehstniche orthographie. won einem finnländer" class="ltx_ref">2</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">. Prior to this,
Estonian orthography was modelled after High German, but uneducated Estonian
peasants spontaneously tended towards the Finnish style orthography </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib21" title="Eesti kirjakeele ajaloost" class="ltx_ref">15</a><span class="ltx_text" style="font-size:90%;">, p.
204]</span></cite><span class="ltx_text" style="font-size:90%;"></span></p>
</div>
<div id="S4.SS3.p2" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The main difference from the previous orthography lies in the simplicity of the
rules for marking phone length: nowadays, the rule of thumb is that a short
phone is marked by one letter, a long (and extra-long) phone by two letters, and
every consonant in a cluster is marked with one letter, even if it is pronounced
long or extra long. As an exception, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">k</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">p</span><span class="ltx_text" style="font-size:90%;"> and </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t</span><span class="ltx_text" style="font-size:90%;"> are
written as </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">g</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">b</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">d</span><span class="ltx_text" style="font-size:90%;"> when short, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">k</span><span class="ltx_text" style="font-size:90%;">,
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">p</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t</span><span class="ltx_text" style="font-size:90%;"> when long, and </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">kk</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">pp</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">tt</span><span class="ltx_text" style="font-size:90%;"> when
extra long. Also, when adjacent to a nonsonorous consonant, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">g</span><span class="ltx_text" style="font-size:90%;">,
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">b</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">d</span><span class="ltx_text" style="font-size:90%;"> are also written as </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">k</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">p</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t</span><span class="ltx_text" style="font-size:90%;">.
In addition to indeterminacy in differentiating between long and extra long
phones (except for </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">k</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">p</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t</span><span class="ltx_text" style="font-size:90%;">), and between short and
long ones in consonant clusters, palatalisation is also not marked. There have
been numerous propositions to improve the Estonian orthography, in order to make
it even more phonetic, e.g. by allowing double letters in consonant clusters,
and three letters for extra long phones, but these propositions have not been
adopted. Very succinct hearing and marking of phone lengths is difficult to
implement in practice, given the various co-articulation effects in real speech.</span></p>
</div>
<div id="S4.SS3.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">In addition to the principle of phone length and letter correspondence, the
</span><span class="ltx_text" style="font-size:90%;">Estonian orthography also to some extent follows the principle of keeping the
traditional form of words (even if it deviates from the current pronunciation),
and the principle of retaining the form of morphemes while inflecting the word
</span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib75" title="Eesti keele käsiraamat" class="ltx_ref">9</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">. Orthography errors tend to happen when these two additional
principles collide with the phonemic principle.</span></p>
</div>
<div id="S4.SS3.p4" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The Estonian list of 3000 misspelled words originates from journalists’ texts.
About one third of it dates from the 1980-1990ies: 1) a re-typed-in Corpus of
Estonian Literary Language</span><span id="footnote13" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">13</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">13</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">13</span></span>
<a href="https://www.cl.ut.ee" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">https://www.cl.ut.ee</a></span></span></span><span class="ltx_text" style="font-size:90%;">, containing 1
million words from 1983–1988, and 2) texts from the news agency Baltic News
Service, from one month in 1996 (about 250 000 words). The errors were gathered
by running an Estonian morphological analyser on the corpus; and then manually
picking misspellings from the set of unanalysed words (by Heili Orav and Leho
Paldre). Another two thirds date from 2000-2010ies, gathered by Kairit Sirts
from a newspaper corpus in an ad hoc manner, according to her own words.</span></p>
</div>
</section>
</section>
<section id="S5" class="ltx_section">
<h2 class="ltx_title ltx_title_section" style="font-size:90%;">
<span class="ltx_tag ltx_tag_section">5 </span>Error types</h2>
<div id="S5.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">An ideal error typology would reflect what went wrong in the chain of actions of
the writer, and/or what was the likely cause, not just count the edit
operations. However, we have not been able to reach this ideal yet. It seems
though that one potential distractor might be the current set of conventions for
writing the language, i.e. its orthography.</span></p>
</div>
<div id="S5.p2" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The full list of registered typos was run through a semi-automatic
</span><span class="ltx_text" style="font-size:90%;">classification system, and tagged according to identified class. The resulting
classification combines edit distance with character classes that are involved
and is summarized in Table </span><a href="#S5.T1" title="Table 1 ‣ 5 Error types ‣ You can’t suggest that?! Comparisons and improvements of speller error models" class="ltx_ref" style="font-size:90%;"><span class="ltx_text ltx_ref_tag"><span class="ltx_text" style="font-size:90%;">1</span></span></a><span class="ltx_text" style="font-size:90%;">. In cases where subclasses
are identified, the figures for those are listed to the left in each column, the
total to the right.</span></p>
</div>
<div id="S5.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Accented letter errors are easy to correct: there are very few alternatives one
should offer, and the reasoning behind the suggestions is transparent, making it
easy for the writer to decide whether to accept or not. An example for Estonian
would be </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*tshempion</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">tšempion</span><span class="ltx_text" style="font-size:90%;">. For North Sámi, this type of
errors is very frequent—one third of misspellings belong to this class, and we
can even identify subclasses: vowel </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">á</span><span class="ltx_text" style="font-size:90%;"> vs </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">a</span><span class="ltx_text" style="font-size:90%;"> (e.g.
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*Amerihka</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">Amerihká</span><span class="ltx_text" style="font-size:90%;">), or consonants </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">č</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">đ</span><span class="ltx_text" style="font-size:90%;">,
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ŋ</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">š</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t-</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ž</span><span class="ltx_text" style="font-size:90%;"> vs </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">c</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">d</span><span class="ltx_text" style="font-size:90%;">,
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">n</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">s</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">z</span><span class="ltx_text" style="font-size:90%;"> (e.g.
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*Cuovvovaccat</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">Čuovvovaččat</span><span class="ltx_text" style="font-size:90%;">,
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*Sámediggerádi</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">Sámediggeráđi</span><span class="ltx_text" style="font-size:90%;">,
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*CD-singel</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">CD-siŋgel</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*oktašas</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">oktasaš</span><span class="ltx_text" style="font-size:90%;">,
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*olbmot-</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">olbmot</span><span class="ltx_text" style="font-size:90%;">,
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*gazaldaga</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">gažaldaga</span><span class="ltx_text" style="font-size:90%;">). In fact, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">a</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">á</span><span class="ltx_text" style="font-size:90%;">
confusion is the single most frequent spelling error in North Sámi texts, around
40% in general according to </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib13" title="Cállinmeattáhusaid guorran." class="ltx_ref">1</a><span class="ltx_text" style="font-size:90%;">, p
24]</span></cite><span id="footnote14" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">14</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">14</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">14</span></span>
<span class="ltx_text" style="font-size:90%;">she includes real-word errors,
which we do not, which probably explains the difference in relative size for
this error type in her investigation compared to our findings.</span></span></span></span><span class="ltx_text" style="font-size:90%;">. The source of
</span><span class="ltx_text" style="font-size:90%;">these errors in North Sámi is likely several. One is lack of keyboard support
that makes it hard to type the correct letter. That was a major issue in social
media texts investigated by Antonsen op.cit., but for several years now there
has been available a North Sámi keyboard app for mobile phones, so this is less
of a problem today. Another possible source is insecurity in the correct
spelling, often in combination with dialectal variation. The
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">a</span><span class="ltx_text" style="font-size:90%;">-</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">á</span><span class="ltx_text" style="font-size:90%;"> confusion can at least partly be attributed to the fact
that the orthography does not follow the phonology in various dialects, the
variation is greater and more complex than the orthography reflects. Also final
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t-</span><span class="ltx_text" style="font-size:90%;"> instead of final </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t</span><span class="ltx_text" style="font-size:90%;"> is most likely based on
pronunciation: in some dialects, the plosive </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t</span><span class="ltx_text" style="font-size:90%;"> is reduced to a pure
fricative </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">h</span><span class="ltx_text" style="font-size:90%;"> sound when followed by a word beginning with a vowel. As
almost all misspellings of </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t-</span><span class="ltx_text" style="font-size:90%;"> for correct </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">t</span><span class="ltx_text" style="font-size:90%;"> can be found
in this position, it is very likely that phonology plays a role. For a more
detailed analysis of spelling errors in North Sámi,
see </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib13" title="Cállinmeattáhusaid guorran." class="ltx_ref">1</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">.</span></p>
</div>
<div id="S5.p4" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Accented letters in South Sámi covers only three pairs: </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">i</span><span class="ltx_text" style="font-size:90%;"> vs </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ï</span><span class="ltx_text" style="font-size:90%;">
(e.g. </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*jih</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">jïh</span><span class="ltx_text" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*hïjven</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">hijven</span><span class="ltx_text" style="font-size:90%;">),
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*ø</span><span class="ltx_text" style="font-size:90%;"> vs </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ö</span><span class="ltx_text" style="font-size:90%;"> (e.g. </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*bøøremes</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">bööremes</span><span class="ltx_text" style="font-size:90%;">), and
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*ä</span><span class="ltx_text" style="font-size:90%;"> vs </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">æ</span><span class="ltx_text" style="font-size:90%;"> (e.g. </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*nännoste</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">nænnoste</span><span class="ltx_text" style="font-size:90%;">). But
they cover more than half of all misspellings in our test data. Out of a total
set of 8 325 misspelling instances, 4 285—or 51.5%—are errors of this
type. The conjuction </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">jïh</span><span class="ltx_text" style="font-size:90%;"> (=</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">and</span><span class="ltx_text" style="font-size:90%;">) alone counts for more than
10% (884 occurrences) of all misspellings. The three pairs fall into two
</span><span class="ltx_text" style="font-size:90%;">categories, one purely orthographic, and one phonological. The </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*ø/ö</span><span class="ltx_text" style="font-size:90%;"> and
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*ä/æ</span><span class="ltx_text" style="font-size:90%;"> pairs are purely orthographic: as South Sámi is spoken in both
Sweden and Norway, the idea is to make a compromise such that one sound is
written using a Swedish letter (</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ö</span><span class="ltx_text" style="font-size:90%;">) and one using a Norwegian letter
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">æ</span><span class="ltx_text" style="font-size:90%;">. Due to the lack of a South Sámi keyboard, people have usually fallen
back to using either a Norwegian or a Swedish keyboard, disregarding the
orthographic norm. In the case of </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">i</span><span class="ltx_text" style="font-size:90%;"> vs </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">ï</span><span class="ltx_text" style="font-size:90%;"> it is a real
phonological opposition, although the distinction was not made in early versions
of the South Sámi orthography. The distinction is also not clear to all
speakers.</span></p>
</div>
<div id="S5.p5" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">As seen above, the error type </span><span class="ltx_text ltx_font_bold" style="font-size:90%;">accented letters</span><span class="ltx_text" style="font-size:90%;"> is a heterogenous class,
with various properties across the languages. It still makes sense to treat them
as one with respect to modelling errors, as they stand out from other
misspellings both in frequency and often simplicity of correction.</span></p>
</div>
<div id="S5.p6" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold" style="font-size:90%;">Deleting</span><span class="ltx_text" style="font-size:90%;"> (or omitting) a letter is a very frequent error. It may be
caused by failing to hit a key, or by failing a phone-to-letter mapping rule. A
suggestion to correct this error by doubling a letter, or (in case of North
Sámi) by creating a diphthong, e.g.
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*departementa</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">departemeanta</span><span class="ltx_text" style="font-size:90%;">, might seem more plausible than
a suggestion to insert a letter in some random position of the same word. Thus,
it makes sense to identify this subclass of deletions.</span></p>
</div>
<div id="S5.p7" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">If the misspelling means that an extra letter has been </span><span class="ltx_text ltx_font_bold" style="font-size:90%;">added</span><span class="ltx_text" style="font-size:90%;">, we also
</span><span class="ltx_text" style="font-size:90%;">identify a subclass of resulting doubles or diphthongs, the classification thus
being similar to the deletion errors.</span></p>
</div>
<div id="S5.p8" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold" style="font-size:90%;">Substitution</span><span class="ltx_text" style="font-size:90%;"> errors are relatively more frequent in the Sámi corpora
than in Estonian. They also involve cases where one letter is substituted by two
(e.g. North Sámi </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*direktora</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">direktevra</span><span class="ltx_text" style="font-size:90%;">), or two by one (e.g.
North Sámi </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*Osllu</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">Oslo</span><span class="ltx_text" style="font-size:90%;">), or two adjacent letters by two
different ones, as in consonant gradation mix-ups (e.g.
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*Sámedikkeválgii</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">Sámediggeválgii</span><span class="ltx_text" style="font-size:90%;">).</span></p>
</div>
<div id="S5.p9" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">In Estonian, the main source of errors is the typing process, as evidenced by
the relatively high proportion of </span><span class="ltx_text ltx_font_bold" style="font-size:90%;">transpositions</span><span class="ltx_text" style="font-size:90%;"> (e.g.
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*komapnii</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">kompanii</span><span class="ltx_text" style="font-size:90%;">) and repetitions (e.g.
</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*poliititika</span><span class="ltx_text" style="font-size:90%;">—</span><span class="ltx_text ltx_font_italic" style="font-size:90%;">poliitika</span><span class="ltx_text" style="font-size:90%;">). Errors relating to incorrectly
writing phones are relatively few. In North Sámi, the main source of errors is
the phone-to-letter process, i.e. applying rules of orthography. Many
substitution errors may be blamed on it. This is also documented and discussed
by </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib13" title="Cállinmeattáhusaid guorran." class="ltx_ref">1</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">.</span></p>
</div>
<div id="S5.p10" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">In South Sámi as well, the main source of errors is the phone-to-letter process,
i.e. applying rules of orthography. In addition, another major source of error
is the morphophonology of the language, especially as related to syllable
structure and its consequences for </span><span class="ltx_text ltx_font_bold" style="font-size:90%;">suffix</span><span class="ltx_text" style="font-size:90%;"> realisation, as exemplified
by </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">*edtjibie</span><span class="ltx_text" style="font-size:90%;"> vs </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">edtjebe</span><span class="ltx_text" style="font-size:90%;">. But the biggest class of errors in
South Sámi is the unclassified </span><span class="ltx_text ltx_font_bold" style="font-size:90%;">other</span><span class="ltx_text" style="font-size:90%;"> group — these are typos that are
</span><span class="ltx_text" style="font-size:90%;">not easily classified by the means used in this work.</span></p>
</div>
<figure id="S5.T1" class="ltx_table">
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_tt"><span class="ltx_text ltx_font_bold" style="font-size:90%;">Main error class</span></th>
<td class="ltx_td ltx_align_center ltx_border_tt"><span class="ltx_text ltx_font_bold" style="font-size:90%;">Subclass</span></td>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_tt" colspan="2"><span class="ltx_text ltx_font_bold" style="font-size:90%;">Estonian</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_tt" colspan="2"><span class="ltx_text ltx_font_bold" style="font-size:90%;">North S</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_tt" colspan="2"><span class="ltx_text ltx_font_bold" style="font-size:90%;">South S</span></th>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t">
<span class="ltx_text ltx_font_italic" style="font-size:90%;">á</span><span class="ltx_text" style="font-size:90%;"> vs </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">a</span>
</td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">25</span></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_center ltx_border_r">
<span class="ltx_text ltx_font_italic" style="font-size:90%;">čđŋšt-ž</span><span class="ltx_text" style="font-size:90%;"> vs </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">cdnstz</span>
</td>
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">8</span></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">Only accented letter errors</span></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_right ltx_border_r"><span class="ltx_text" style="font-size:90%;">2</span></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">=</span></th>
<td class="ltx_td ltx_align_right ltx_border_r"><span class="ltx_text" style="font-size:90%;">33</span></td>
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:90%;">5</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">double or diphthong</span></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">7</span></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">13</span></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:90%;">other</span></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">37</span></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">7</span></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">Delete 1</span></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">=</span></th>
<td class="ltx_td ltx_align_right ltx_border_r"><span class="ltx_text" style="font-size:90%;">44</span></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">=</span></th>
<td class="ltx_td ltx_align_right ltx_border_r"><span class="ltx_text" style="font-size:90%;">20</span></td>
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:90%;">18</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">double or diphthong</span></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">7</span></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">11</span></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:90%;">other</span></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">16</span></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">4</span></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">Add 1</span></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">=</span></th>
<td class="ltx_td ltx_align_right ltx_border_r"><span class="ltx_text" style="font-size:90%;">23</span></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">=</span></th>
<td class="ltx_td ltx_align_right ltx_border_r"><span class="ltx_text" style="font-size:90%;">15</span></td>
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:90%;">14</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">Substitute 1</span></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">13</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">17</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_t"><span class="ltx_text" style="font-size:90%;">11</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">1 to 2 or 2 to 1</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">3</span></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">7</span></th>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:90%;">adjacent</span></td>
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">2</span></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">4</span></th>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">Substitute 2</span></th>
<td class="ltx_td ltx_border_r"></td>
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_right ltx_border_r"><span class="ltx_text" style="font-size:90%;">0</span></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">=</span></th>
<td class="ltx_td ltx_align_right ltx_border_r"><span class="ltx_text" style="font-size:90%;">5</span></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_row"><span class="ltx_text" style="font-size:90%;">=</span></th>
<td class="ltx_td ltx_align_right"><span class="ltx_text" style="font-size:90%;">11</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">Transposition</span></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">10</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">2</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_t"><span class="ltx_text" style="font-size:90%;">2</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">Repetition; South S=suffix</span></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">2</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">0</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_t"><span class="ltx_text" style="font-size:90%;">3</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t"><span class="ltx_text" style="font-size:90%;">Other</span></th>
<td class="ltx_td ltx_border_r ltx_border_t"></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">6</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">8</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_right ltx_border_t"><span class="ltx_text" style="font-size:90%;">36</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb ltx_border_tt"><span class="ltx_text" style="font-size:90%;">Total</span></th>
<td class="ltx_td ltx_border_bb ltx_border_r ltx_border_tt"></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_bb ltx_border_tt"></th>
<td class="ltx_td ltx_align_right ltx_border_bb ltx_border_r ltx_border_tt"><span class="ltx_text" style="font-size:90%;">100%</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_bb ltx_border_tt"></th>
<td class="ltx_td ltx_align_right ltx_border_bb ltx_border_r ltx_border_tt"><span class="ltx_text" style="font-size:90%;">100%</span></td>
<th class="ltx_td ltx_th ltx_th_row ltx_border_bb ltx_border_tt"></th>
<td class="ltx_td ltx_align_right ltx_border_bb ltx_border_tt"><span class="ltx_text" style="font-size:90%;">100%</span></td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption ltx_centering" style="font-size:90%;"><span class="ltx_tag ltx_tag_table">Table 1: </span>Error types, percentage of all errors.</figcaption>
</figure>
</section>
<section id="S6" class="ltx_section">
<h2 class="ltx_title ltx_title_section" style="font-size:90%;">
<span class="ltx_tag ltx_tag_section">6 </span>Error models</h2>
<div id="S6.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The error models we study are: the baseline, a new regex model, and a machine
learned model. The baseline model is a general edit distance 2 model built from
the alphabet of the language, with some language-specific tweaks described
</span><span class="ltx_text" style="font-size:90%;">below, whereas the regex model focuses on documented and generalisable error
types for the language in question.</span></p>
</div>
<section id="S6.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection" style="font-size:90%;">
<span class="ltx_tag ltx_tag_subsection">6.1 </span>Baseline error models for South and North Sámi</h3>
<div id="S6.SS1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The baseline error models for North and South Sámi are the ones used in
production</span><span id="footnote15" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">15</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">15</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">15</span></span>
<a href="https://divvun.no" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">https://divvun.no</a></span></span></span><span class="ltx_text" style="font-size:90%;">. They are both built following the
same structure, and as such the models will be described only once. A general
description of the production error model can be found
online</span><span id="footnote16" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">16</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">16</sup>
<span class="ltx_tag ltx_tag_note"><span class="ltx_text" style="font-size:90%;">16</span></span>
<a href="https://giellalt.uit.no/proof/TheSpellerErrorModel.html" title="" class="ltx_ref ltx_url ltx_font_typewriter" style="font-size:90%;">https://giellalt.uit.no/proof/TheSpellerErrorModel.html</a></span></span></span><span class="ltx_text" style="font-size:90%;">.</span></p>
</div>
<div id="S6.SS1.p2" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The starting point is a Levenshtein edit distance </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" style="font-size:90%;">[</span><a href="#bib.bib146" title="Binary codes capable of correcting deletions, insertions, and reversals" class="ltx_ref">18</a><span class="ltx_text" style="font-size:90%;">]</span></cite><span class="ltx_text" style="font-size:90%;">
error model based on the alphabets of the language, with an editing distance of
two. It is possible to adjust the weight of specific edits in the edit distance
2 error model. Adjacent swaps are not enabled by default (they are
computationally quite expensive in the present implementation).</span></p>
</div>
<div id="S6.SS1.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Parallel to the default Levenshtein error model, there is a separate set of
string edits, handwritten based on identified and frequent error patterns in the
languages. The string edits are single FST operations, although each string can
be arbitrarily long, thus allowing for much more complex edits than the default
model. The string edits are applied as many timed as the default error model,
that is, up to twice for both North and South Sámi.</span></p>
</div>
<div id="S6.SS1.p4" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Another extension to the default model is one of suffix edits. That is, a simple
</span><span class="ltx_text" style="font-size:90%;">transducer mapping input strings to output strings, as the string edits
described above, but now restricted to the end of the word. As described above,
errors in suffixes are relatively common in especially South Sámi, and this
module is meant to target such errors.</span></p>
</div>
<div id="S6.SS1.p5" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Finally, there is a whole-word string replacement module, but that one is
utilized very rarely, and does not impact the performance very much. It is also
applied to the new regex models described below, mainly because it would be more
work to avoid using it.</span></p>
</div>
<div id="S6.SS1.p6" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">For Estonian, the regex model is the first one implemented in FST. It is based
on the earlier work by Filosoft; no earlier baseline models have been developed
for Estonian.</span></p>
</div>
</section>
<section id="S6.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection" style="font-size:90%;">
<span class="ltx_tag ltx_tag_subsection">6.2 </span>Rule-based error models</h3>
<div id="S6.SS2.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">The </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">regular expressions</span><span class="ltx_text" style="font-size:90%;"> (regexes) are grouped according to our
assumptions about the nature and likelihood of different types of spelling
errors. Also, although guided by the principle that when ranking, one should
prefer suggestions with fewer modifications, ours is not based directly on
Levenshtein distance. The reasoning is that when calculating the amount of
difference between two words, one should view them not as mere symbol strings,
but as the traces of a series of mental and physical actions. A change in one
action may result in multiple changes in the letter sequence, but it should
still be counted as one error.
</span></p>
</div>
<div id="S6.SS2.p2" class="ltx_para">
<ul id="S6.I1" class="ltx_itemize">
<li id="S6.I1.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item"><span class="ltx_text" style="font-size:90%;">•</span></span>
<div id="S6.I1.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">Keyboard and orthography (mis)matches. In addition to the Latin letters
that form the core of the alphabet, languages typically need some (usually
accented) modifications of some of these letters, corresponding to the
phones not covered by the core alphabet. These accented letters tend to
be positioned in the periphery of the standard keyboard, and/or need
key combinations to be used for appearing in the text. It is to be
expected that such letters also tend to be mistyped. Also, an accent on
a letter may indicate a minor pronunciation subtlety which the speakers
need not pay much attention to, so mixing similarly looking and sounding
letters would be easy.</span></p>
</div>
<div id="S6.I1.i1.p2" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:90%;">For Estonian, the misspelling list indicates that in case the keyboard does not
provide a convenient way to type the accented letters, users may come up
with an alternative orthography, e.g. use </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">sh</span><span class="ltx_text" style="font-size:90%;"> or </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">s^</span><span class="ltx_text" style="font-size:90%;">
instead of the correct </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">š</span><span class="ltx_text" style="font-size:90%;">. If this is the case, then one may
expect unlimited substitutions of this kind in a wordform (in addition
to other errors). Nordic letters that are not part of the Sámi
alphabets, and </span><span class="ltx_text ltx_font_italic" style="font-size:90%;">á</span><span class="ltx_text" style="font-size:90%;"> which is notoriously difficult for North Sámi
writers to use correctly, also belong to this class of errors.
Correcting them is weighted lightly, and the number of such edit
operations is not limited.</span></p>