-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathPirinen-2014-cicling.html
1436 lines (1411 loc) · 104 KB
/
Pirinen-2014-cicling.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html><html>
<head>
<title>State-of-the-Art in Weighted Finite-State Spell-Checking</title>
<!--Generated on Fri Sep 29 14:26:40 2017 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on Last Modification: September 29, 2017.-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">State-of-the-Art in Weighted Finite-State Spell-Checking</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Tommi A Pirinen and Krister Lindén
<br class="ltx_break">University of Helsinki
<br class="ltx_break">Department of Modern Languages
<br class="ltx_break"><span class="ltx_ERROR undefined">\url</span>[email protected] and <span class="ltx_ERROR undefined">\url</span>[email protected]
</span></span>
</div>
<div class="ltx_date ltx_role_creation">Last Modification: September 29, 2017</div>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p">The following claims can be made about finite-state methods for spell-checking:
1) Finite-state language models provide support for morphologically complex
languages that word lists, affix stripping and similar approaches do not
provide; 2) Weighted finite-state models have expressive power equal to other,
state-of-the-art string algorithms used by contemporary spell-checkers; and 3)
Finite-state models are at least as fast as other string algorithms for lookup
and error correction. In this article, we use some contemporary
non-finite-state spell-checking methods as a baseline and perform tests in
light of the claims, to evaluate state-of-the-art finite-state spell-checking
methods. We verify that finite-state spell-checking systems outperform the
traditional approaches for English. We also show that the models for
morphologically complex languages can be made to perform on par with English
systems.</p>
<p class="ltx_p"><span class="ltx_text ltx_font_bold" style="font-size:90%;">Keywords:<span class="ltx_text ltx_font_medium"> spell-checking, weighted finite-state technology, error models</span></span></p>
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<span class="ltx_ERROR undefined">\footnotepubrights</span>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p">This version is fully reformatted from original which
was made during serious time constraints due to my graduation.
Spell-checking and correction is a traditional and well-researched part of
computational linguistics. Finite-state methods for language models are
widely recognized as a good way to handle languages which are morphologically
more complex <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib66" title="Finite state morphology" class="ltx_ref">2</a>]</cite>. In this article, we evaluate weighted, fully finite-state
spell-checking systems for morphologically complex languages. We use existing
finite-state models and algorithms and describe some necessary additions to
bridge the gaps and surpass state-of-the-art in non-finite-state
spell-checking. For the set of languages, we have chosen to study North Sámi
and Finnish from the complex, agglutinative group of languages, Green- landic
from the complex poly-agglutinative group, and English to confirm that our
finite-state formulations of traditional spelling correction applications are
working as described in the literature.
</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">As contemporary spell-checkers are increasingly using statistical approaches
for the task, weighted finite-state models provide the equivalent expressive
power, even for the morphologically more complex languages, by encoding the
probabilities as weights in the automata. As the programmatic noisy channel
models <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib67" title="An improved error model for noisy channel spelling correction" class="ltx_ref">6</a>]</cite> can encode the error probabilities when making
the corrections, so can the weighted finite-state automata encode these
probabilities.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">The task of spell-checking is split into two parts, error detection and error
correction. Error detection by language model lookup is referred to as
non-word or isolated error detection. The task of detecting isolated errors is
often considered trivial or solved in many research papers dealing with
spelling correction, e.g. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib26" title="Contextual spelling correction" class="ltx_ref">20</a>]</cite>. More complex error
detection systems may be used to detect words that are correctly spelled, but
are unsuitable in the syntactic or semantic context. This is referred to as
real-word error detection in context <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib68" title="Context based spelling correction" class="ltx_ref">15</a>]</cite>.</p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">The task of error-correction is to generate the most likely correct word-forms
given a misspelled word-form. This can also be split in two different tasks:
generating sug- gestions and ranking them. Generating corrections is often
referred to as error modeling. The main point of error modeling is to correct
spelling errors accurately by ob- serving the causes of errors and making
predictive models of them <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib22" title="Correcting spelling errors by modelling their causes" class="ltx_ref">10</a>]</cite>. This effectively
splits the error models into numerous sub-categories, each applicable to
correcting specific types of spelling errors. The most used model accounts for
typos, i.e. the slip of a finger on a keyboard. This model is nearly language
agnostic, although it can be tuned to each local keyboard layout. The other set
of errors is more language and user-specificâit stems from the lack of
knowledge or language competence, e.g., in non-phonemic orthographies, such as
English, learners and unskilled writers commonly make mistakes such as
writing their instead of there, as they are pronounced alike; similarly
competence errors will give rise to common confusable words in other languages,
such as missing an accent, writing a digraph instead of its unigraph variant,
or confusing one morph with another.</p>
</div>
<div id="S1.p5" class="ltx_para">
<p class="ltx_p">A common source of the probabilities for ranking suggestions related to
competence errors are the neighboring words and word-forms captured in a
language model. For morphologically complex languages, part-of-speech
information is needed <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib26" title="Contextual spelling correction" class="ltx_ref">20</a>, <a href="#bib.bib29" title="Improving finite-state spell-checker suggestions with part of speech n-grams" class="ltx_ref">24</a>]</cite>, which
can be compared with the studies on isolating
languages <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib68" title="Context based spelling correction" class="ltx_ref">15</a>, <a href="#bib.bib59" title="Real-word spelling correction with trigrams: a reconsideration of the Mays, Damerau, and Mercer model" class="ltx_ref">30</a>]</cite>. Context-based models
like these are, however, considered to be out of scope for spell-checking,
rather being part of grammar-checking.</p>
</div>
<div id="S1.p6" class="ltx_para">
<p class="ltx_p">Advanced language model training schemes, such as the use of morphological
analyses as error detection evidence <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib68" title="Context based spelling correction" class="ltx_ref">15</a>]</cite>, require large
manually verified morphologically analyzed and disambiguated corpora, which do
not exist as open, freely usable resources, if at all. In addition, for
polysynthetic languages like Greenlandic, even a gigaword corpus is usually not
nearly as complete as an English corpus with a million word-forms.</p>
</div>
<div id="S1.p7" class="ltx_para">
<p class="ltx_p">As we compare existing finite-state technologies with contemporary
non-finite-state string algorithm solutions, we use
Hunspell<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_ERROR undefined">\url</span>http://hunspell.sf.net</span></span></span> as setting the current de facto
standard in open-source spell-checking and the baseline for the quality to
achieve. Taken together this paper demonstrates for the first time that using
weighted finite-state technology, spell-checking for morphologically complex
languages can be made to perform on par with English systems and surpass the
current de facto standard.</p>
</div>
<div id="S1.p8" class="ltx_para">
<p class="ltx_p">This article is structured as follows: In Subsection 1.1, we briefly describe
the history of spell-checking up to the finite-state formulation of the
problem. In Subsection 1.2, we revisit the notations behind the statistics we
apply to our language and error models. In Section 2, we present existing
methods for creating finite-state language and error models for spell-checkers.
In Section 3, we present the actual data, the language models, the error
models and the corpora we have used, and in Section 4, we show how different
languages and error models affect the accuracy, precision, and speed of
finite-state spell-checking. In Section 5, we discuss the results, and finally,
in Section 6, we conclude our findings.</p>
</div>
<section id="S1.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">1.1 </span>A Brief History of Automatic Spell-Checking and Correction</h3>
<div id="S1.SS1.p1" class="ltx_para">
<p class="ltx_p">Automatic spelling correction by computer is in itself, an old invention, with
the ini- tial work done as early as in the 1960âs. Beginning with the invention
of the generic error model for typing mistakes, the Levenshtein-Damerau
distance <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib53" title="A technique for computer detection and correction of spelling errors" class="ltx_ref">8</a>, <a href="#bib.bib55" title="ÐвоиÑнÑе ÐºÐ¾Ð´Ñ Ñ Ð¸ÑпÑавлением вÑпадений, вÑÑавок и замеÑений Ñимволов" class="ltx_ref">9</a>]</cite> and the first
applications of the noisy channel model <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib28" title="A mathematical theory of communications, i and ii" class="ltx_ref">29</a>]</cite> to
spell-checking <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib27" title="Decision making in Markov chains applied to the problem of pattern recognition" class="ltx_ref">26</a>]</cite>, the early solutions treated the
dictionaries as simple word lists, or later, word-lists with up to a few
affixes with simple stem mutations and finally some basic compounding process-
es. The most recent and widely spread implementation with a word-list, stem
muta- tions, affixes and some compounding is Hunspell, which is in common use
in the open-source world of spell-checking and correction and must be regarded
as the ref- erence implementation. The word-list approach, even with some affix
stripping and stem mutations, has sometimes been found insufficient for
morphologically complex languages. E.g. a recent attempt to utilize Hunspell
for Finnish was unsuccessful <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib24" title="Hunspell-in kesäkoodi 2006: final report" class="ltx_ref">25</a>]</cite>. In part, the
popularity of the finite-state methods in computational linguistics seen in the
1980âs was driven by a need for the morphologically more complex languages to
get language models and morphological analyzers with recurring derivation and
compounding processes <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib23" title="Morphological analysis and generation: a first step in natural language processing" class="ltx_ref">3</a>]</cite>. They also provide an
opportunity to use arbitrary finite-state automata as language models without
modifying the runtime code, e.g. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib89" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">23</a>]</cite>.</p>
</div>
<div id="S1.SS1.p2" class="ltx_para">
<p class="ltx_p">Given the finite-state representation of the dictionaries and the expressive
power of the finite-state systems, the concept of a finite-state based
implementation for spelling correction was an obvious development. The earliest
approaches presented an algorithmic way to implement the finite-state network
traversal with error-tolerance <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib43" title="Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction" class="ltx_ref">19</a>]</cite> in a fast and
effective manner <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib50" title="Fast approximate string matching with finite automata" class="ltx_ref">12</a>, <a href="#bib.bib48" title="Typographical nearest-neighbor search in a finite-state lexicon and its application to spelling correction" class="ltx_ref">27</a>]</cite>. Schulz
and Mihov <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib51" title="Fast string correction with levenshtein-automata" class="ltx_ref">28</a>]</cite> presented the Levenshtein-Damerau distance in a
finite-state form such that the finite-state spelling correction could be
performed using standard finite-state algebraic operations with any existing
finite-state library. Furthermore, e.g., Pirinen and
Lindén <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib6" title="Creating and weighting Hunspell dictionaries as finite-state automata" class="ltx_ref">22</a>]</cite> have shown that the weighted finite-state
methods can be used to gain the same expressive power as the existing
statistical spellchecking software algorithms.</p>
</div>
</section>
<section id="S1.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">1.2 </span>Notations and some Statistics for Language and Error Models</h3>
<div id="S1.SS2.p1" class="ltx_para">
<p class="ltx_p">In this article, where the formulas of finite-state algebra are concerned, we
assume the standard notations from Aho et al. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib18" title="Compilers: principles, techniques, and tools" class="ltx_ref">1</a>]</cite>: a
finite-state automaton <math id="S1.SS2.p1.m1" class="ltx_Math" alttext="\mathcal{M}" display="inline"><mi class="ltx_font_mathcaligraphic">ℳ</mi></math> is a system <math id="S1.SS2.p1.m2" class="ltx_Math" alttext="(Q,\Sigma,\delta,Q_{s},Q_{f},W)" display="inline"><mrow><mo stretchy="false">(</mo><mi>Q</mi><mo>,</mo><mi mathvariant="normal">Σ</mi><mo>,</mo><mi>δ</mi><mo>,</mo><msub><mi>Q</mi><mi>s</mi></msub><mo>,</mo><msub><mi>Q</mi><mi>f</mi></msub><mo>,</mo><mi>W</mi><mo stretchy="false">)</mo></mrow></math>, where <math id="S1.SS2.p1.m3" class="ltx_Math" alttext="Q" display="inline"><mi>Q</mi></math> is the set of states, <math id="S1.SS2.p1.m4" class="ltx_Math" alttext="\Sigma" display="inline"><mi mathvariant="normal">Σ</mi></math> the alphabet, <math id="S1.SS2.p1.m5" class="ltx_Math" alttext="\delta" display="inline"><mi>δ</mi></math> the
transition mapping of form <math id="S1.SS2.p1.m6" class="ltx_Math" alttext="Q\times\Sigma\rightarrow Q" display="inline"><mrow><mrow><mi>Q</mi><mo>×</mo><mi mathvariant="normal">Σ</mi></mrow><mo>→</mo><mi>Q</mi></mrow></math>, and <math id="S1.SS2.p1.m7" class="ltx_Math" alttext="Q_{s}" display="inline"><msub><mi>Q</mi><mi>s</mi></msub></math> and <math id="S1.SS2.p1.m8" class="ltx_Math" alttext="Q_{f}" display="inline"><msub><mi>Q</mi><mi>f</mi></msub></math>
the initial and final states of the automaton, respectively. For weighted
automata, we extend the definition in the same way as
Mohri <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib19" title="Weighted automata algorithms" class="ltx_ref">16</a>]</cite> such that <math id="S1.SS2.p1.m9" class="ltx_Math" alttext="\delta" display="inline"><mi>δ</mi></math> is extended to the transition
mapping <math id="S1.SS2.p1.m10" class="ltx_Math" alttext="Q×\Sigma ×W\rightarrow Q" display="inline"><mrow><mrow><mi>Q</mi><mo></mo><mi>Ã</mi><mo></mo><mi mathvariant="normal"></mi><mo></mo><mi mathvariant="normal">Σ</mi><mo></mo><mi>Ã</mi><mo></mo><mi mathvariant="normal"></mi><mo></mo><mi>W</mi></mrow><mo>→</mo><mi>Q</mi></mrow></math>, where <math id="S1.SS2.p1.m11" class="ltx_Math" alttext="W" display="inline"><mi>W</mi></math> is the weight, and the system
additionally includes a final weight mapping <math id="S1.SS2.p1.m12" class="ltx_Math" alttext="\rho:Q_{f}\rightarrow W" display="inline"><mrow><mi>ρ</mi><mo>:</mo><mrow><msub><mi>Q</mi><mi>f</mi></msub><mo>→</mo><mi>W</mi></mrow></mrow></math>. The
structure we use for weights is systematically the tropical semiring
<math id="S1.SS2.p1.m13" class="ltx_Math" alttext="(\mathbb{R}_{+}\cup\infty,\min,+\infty,0)" display="inline"><mrow><mo stretchy="false">(</mo><mrow><msub><mi>ℝ</mi><mo>+</mo></msub><mo>∪</mo><mi mathvariant="normal">∞</mi></mrow><mo>,</mo><mi>min</mi><mo>,</mo><mrow><mo>+</mo><mi mathvariant="normal">∞</mi></mrow><mo>,</mo><mn>0</mn><mo stretchy="false">)</mo></mrow></math>, i.e. weights are positive real
numbers that are collected by addition. The tropical semiring models penalty
weighting.</p>
</div>
<div id="S1.SS2.p2" class="ltx_para">
<p class="ltx_p">For the finite-state spell-checking, we use the following common notations:
<math id="S1.SS2.p2.m1" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{D}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">D</mi></msub></math> is
a single tape weighted finite-state automaton used for detecting the spelling errors, <math id="S1.SS2.p2.m2" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{S}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub></math>
is a single tape weighted finite-state automaton used as a language model when
suggesting correct words, where the weight is used for ranking the suggestions. On
many occasions, we consider the possibility that
<math id="S1.SS2.p2.m3" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{D}}=\mathcal{M}_{\mathrm{S}}" display="inline"><mrow><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">D</mi></msub><mo>=</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub></mrow></math>. The error models are
weighted two-tape automata commonly marked as <math id="S1.SS2.p2.m4" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{E}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">E</mi></msub></math>
. A word automaton is generally marked as <math id="S1.SS2.p2.m5" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{word}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi>word</mi></msub></math>
. A misspelling is detected by composing the word automaton
with the detection automaton:</p>
</div>
<div id="S1.SS2.p3" class="ltx_para">
<table id="S1.E1" class="ltx_equation ltx_eqn_table">
<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S1.E1.m1" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{word}}\circ\mathcal{M}_{\mathrm{D}}," display="block"><mrow><mrow><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi>word</mi></msub><mo>∘</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">D</mi></msub></mrow><mo>,</mo></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td>
</tr>
</table>
</div>
<div id="S1.SS2.p4" class="ltx_para">
<p class="ltx_p">which results in an empty automaton on a misspelling and a non-empty automaton
on a correct spelling. The weight of the result may represent the likelihood or
the correctness of the word-form. Corrections for misspelled words can be
obtained by composing a misspelled word, an error model and a model of
correct words:</p>
</div>
<div id="S1.SS2.p5" class="ltx_para">
<table id="S1.E2" class="ltx_equation ltx_eqn_table">
<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S1.E2.m1" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{word}}\circ\mathcal{M}_{\mathrm{E}}\circ\mathcal{M}_{%
\mathrm{S}}," display="block"><mrow><mrow><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi>word</mi></msub><mo>∘</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">E</mi></msub><mo>∘</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub></mrow><mo>,</mo></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td>
</tr>
</table>
<p class="ltx_p">which results in a two-tape automaton consisting of the misspelled word-form
mapped to the spelling corrections described by the error model
<math id="S1.SS2.p5.m1" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{E}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">E</mi></msub></math>
and approved by the suggestion language model <math id="S1.SS2.p5.m2" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{S}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub></math>
. Both models may be weighted and the weight is
collected by standard operations as defined by the effective semiring.</p>
</div>
<div id="S1.SS2.p6" class="ltx_para">
<p class="ltx_p">Where probabilities are used, the basic formula to estimate probabilities from
discrete frequencies of events (word-forms, mistyping events, etc.) is as
follows: <math id="S1.SS2.p6.m1" class="ltx_Math" alttext="P(x)=\frac{c(x)}{\mathrm{corpussize}}" display="inline"><mrow><mrow><mi>P</mi><mo></mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><mfrac><mrow><mi>c</mi><mo></mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow><mi>corpussize</mi></mfrac></mrow></math> , <math id="S1.SS2.p6.m2" class="ltx_Math" alttext="x" display="inline"><mi>x</mi></math> where is the event,
<math id="S1.SS2.p6.m3" class="ltx_Math" alttext="c(x)" display="inline"><mrow><mi>c</mi><mo></mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow></math> is the count or frequency of the event, and <math id="S1.SS2.p6.m4" class="ltx_Math" alttext="\mathrm{corpussize}" display="inline"><mi>corpussize</mi></math> is
the sum of all event counts in the training corpus. The en coding of
probability as tropical weights in a finite-state automaton is done by setting
is the end weight of path <math id="S1.SS2.p6.m5" class="ltx_Math" alttext="Q_{\pi_{x}}=-\log P(x)" display="inline"><mrow><msub><mi>Q</mi><msub><mi>π</mi><mi>x</mi></msub></msub><mo>=</mo><mrow><mo>-</mo><mrow><mrow><mi>log</mi><mo></mo><mi>P</mi></mrow><mo></mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow></mrow></mrow></math>, though in practice the
<math id="S1.SS2.p6.m6" class="ltx_Math" alttext="-\log P(x)" display="inline"><mrow><mo>-</mo><mrow><mrow><mi>log</mi><mo></mo><mi>P</mi></mrow><mo></mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow></mrow></math> weight may be distributed along the path depending on the
specific implementation. As events not appearing in corpora should have a
larger probability than zero, we use additive smoothing <math id="S1.SS2.p6.m7" class="ltx_Math" alttext="P(\hat{x})=\frac{c(x)+\alpha}{\mathrm{corpussize}+\mathrm{dictionarysize}%
\times\alpha}" display="inline"><mrow><mrow><mi>P</mi><mo></mo><mrow><mo stretchy="false">(</mo><mover accent="true"><mi>x</mi><mo stretchy="false">^</mo></mover><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><mfrac><mrow><mrow><mi>c</mi><mo></mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow><mo>+</mo><mi>α</mi></mrow><mrow><mi>corpussize</mi><mo>+</mo><mrow><mi>dictionarysize</mi><mo>×</mo><mi>α</mi></mrow></mrow></mfrac></mrow></math>, so
for an unknown event , the probability will be counted as if it had <math id="S1.SS2.p6.m8" class="ltx_Math" alttext="\alpha" display="inline"><mi>α</mi></math>
appearances. Another approach would be to set <math id="S1.SS2.p6.m9" class="ltx_Math" alttext="P(\hat{x})<\frac{1}{\mathrm{corpussize}}" display="inline"><mrow><mrow><mi>P</mi><mo></mo><mrow><mo stretchy="false">(</mo><mover accent="true"><mi>x</mi><mo stretchy="false">^</mo></mover><mo stretchy="false">)</mo></mrow></mrow><mo><</mo><mfrac><mn>1</mn><mi>corpussize</mi></mfrac></mrow></math>, which makes the probability distribution leak
but may work under some conditions <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib20" title="Large language models in machine translation" class="ltx_ref">5</a>]</cite>.
</p>
</div>
</section>
<section id="S1.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">1.3 </span>Morphologically Complex Resource-Poor Languages</h3>
<div id="S1.SS3.p1" class="ltx_para">
<p class="ltx_p">One of the main reasons for going fully finite-state instead of relying on
word-form lists and affix stripping is the claim that morphologically complex
languages simply cannot be handled with sufficient coverage and quality using
traditional methods. While Hunspell has virtually 100 % domination of the
open-source spell-checking field, authors of language models for
morphologically complex languages such asTurkish (cf.
Zemberek<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup>http://code.google.com/p/zemberek</span></span></span> ) and Finnish (cf.
Voikko<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>http://voikko.sf.net</span></span></span> ) have still opted to write separate
software, even though it makes the usage of their spell-checkers troublesome
and the coverage of supported applications much smaller.</p>
</div>
<div id="S1.SS3.p2" class="ltx_para">
<p class="ltx_p">Another aspect of the problems with morphologically complex languages is that
the amount of training data in terms of running word-forms is greater, as the
amount of unique word-forms in an average text is much higher compared with
morphologi- cally less complex languages. In addition, the majority of
morphologically complex languages tend to have fewer resources to train the
models. For training spelling checkers, the data needed is merely correctly
written unannotated text, but even that is scarce when it comes to languages
like Greenlandic or North Sámi. Even a very simple probabilistic weighting
using a small corpus of unverified texts will improve the quality of
suggestions <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib6" title="Creating and weighting Hunspell dictionaries as finite-state automata" class="ltx_ref">22</a>]</cite>, so having a weighted language model is
more effective.
</p>
</div>
</section>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Weighting Finite-State Language and Error Models</h2>
<div id="S2.p1" class="ltx_para">
<p class="ltx_p">The task of spell-checking is divided into locating spelling errors, and
suggesting the corrections for the spelling errors. In finite-state
spell-checking, the former task requires a language model that can tell
whether or not a given string is correct. The error correction requires two
components: a language model and an error model.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">The error model is a two-tape finite-state automaton that can encode the
relation between misspellings and the correctly typed words. This relation can
also be weighted with the probabilities of making a specific typo or error, or
arbitrary hand-made penalties as with many of the traditional non-finite-state
approaches, e.g. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib25" title="Hunspell manual" class="ltx_ref">17</a>]</cite>.</p>
</div>
<div id="S2.p3" class="ltx_para">
<p class="ltx_p">The rest of this section is organized as follows. In Subsection 2.1, we
describe how finite-state language models are made. In Subsection 2.2, we
describe how finite-state error models are made. In Subsection 2.3, we describe
some methods for combining the weights in language models with the weights in
error models.</p>
</div>
<section id="S2.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.1 </span>Compiling Finite-State Language Models</h3>
<div id="S2.SS1.p1" class="ltx_para">
<p class="ltx_p">The baseline for any language model as realized by numerous spell-checking
systems and the literature is a word-list (or a word-form list). One of the
most popular exam- ples of this approach is given by Norvig <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib69" title="How to write a spelling corrector" class="ltx_ref">18</a>]</cite>,
describing a toy spelling corrector being made during an intercontinental
flight. The finite-state formulation of this idea is equally simple; given a
list of word-forms, we compile each string as a path in an
automaton <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Effects of weighted finite-state language and error models on speed and efficiency of finite-state spell-checking" class="ltx_ref">21</a>]</cite>. In fact, even the classical optimized data
structures used for efficient- ly encoding word lists, like tries and acyclic
deterministic finite-state automata, are usable as finite-state automata for
our purposes, without modifications. We have:
<math id="S2.SS1.p1.m1" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{S}}=\mathcal{M}_{\mathrm{D}}=\bigcup_{\mathrm{wf}\in%
\mathrm{corpus}}\mathrm{wf}" display="inline"><mrow><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub><mo>=</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">D</mi></msub><mo>=</mo><mrow><msub><mo largeop="true" mathsize="160%" stretchy="false" symmetric="true">⋃</mo><mrow><mi>wf</mi><mo>∈</mo><mi>corpus</mi></mrow></msub><mi>wf</mi></mrow></mrow></math>, where <math id="S2.SS1.p1.m2" class="ltx_Math" alttext="\mathrm{wf}" display="inline"><mi>wf</mi></math> is a word-form and <math id="S2.SS1.p1.m3" class="ltx_Math" alttext="\mathrm{corpus}" display="inline"><mi>corpus</mi></math> is a
set of word-forms in a corpus. These are already valid language models for
Formula 1 and 2, but in practice any finite-state
lexicon <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib66" title="Finite state morphology" class="ltx_ref">2</a>]</cite> will suffice.</p>
</div>
</section>
<section id="S2.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2 </span>Compiling Finite-State Versions of Error Models</h3>
<div id="S2.SS2.p1" class="ltx_para">
<p class="ltx_p">The baseline error model for spell-checking is the Damerau-Levenshtein distance
measure. As the finite-state formulations of error models are the most recent
devel- opment in finite-state spell-checking, the earliest reference to a
finite-state error mod- el in an actual spell-checking system is by Schulz and
Mihov <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib51" title="Fast string correction with levenshtein-automata" class="ltx_ref">28</a>]</cite>. It also contains a very thorough description of
building finite-state models for different edit distances. As error models,
they can be applied in Formula 2.The edit distance type error models used in
this article are all simple edit distance models.</p>
</div>
<div id="S2.SS2.p2" class="ltx_para">
<p class="ltx_p">One of the most popular modifications to speed up the edit distance algorithm
is to disallow modifications of the first character of the
word <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib30" title="Spelling error pattern analysis of punjabi typed text" class="ltx_ref">4</a>]</cite>. This modification pro- vides a measurable
speed-up at a low cost to recall. The finite-state implementation of it is
simple; we concatenate one unmodifiable character in front of the error model.</p>
</div>
<div id="S2.SS2.p3" class="ltx_para">
<p class="ltx_p">Hunspellâs implementation of the correction algorithm uses configurable
alphabets for the error types in the edit distance model. The errors that do
not come from regular typing mistakes are nearly always covered by specific
string transformations, i.e. confusion sets. Encoding a simple string
transformation as a finite-state automaton can be done as follows: for any
given transformation <math id="S2.SS2.p3.m1" class="ltx_Math" alttext="S:U" display="inline"><mrow><mi>S</mi><mo>:</mo><mi>U</mi></mrow></math> , we have a path <math id="S2.SS2.p3.m2" class="ltx_Math" alttext="\pi_{s:u}=S_{1}:U_{1}S_{2}:U_{2}\ldots S_{n}:U_{n}" display="inline"><mrow><mrow><msub><mi>π</mi><mrow><mi>s</mi><mo>:</mo><mi>u</mi></mrow></msub><mo>=</mo><msub><mi>S</mi><mn>1</mn></msub></mrow><mo>:</mo><mrow><msub><mi>U</mi><mn>1</mn></msub><mo></mo><msub><mi>S</mi><mn>2</mn></msub></mrow><mo>:</mo><mrow><msub><mi>U</mi><mn>2</mn></msub><mo></mo><mi mathvariant="normal">…</mi><mo></mo><msub><mi>S</mi><mi>n</mi></msub></mrow><mo>:</mo><msub><mi>U</mi><mi>n</mi></msub></mrow></math> , where <math id="S2.SS2.p3.m3" class="ltx_Math" alttext="n=\max(|S|,|U|)" display="inline"><mrow><mi>n</mi><mo>=</mo><mrow><mi>max</mi><mo></mo><mrow><mo stretchy="false">(</mo><mrow><mo stretchy="false">|</mo><mi>S</mi><mo stretchy="false">|</mo></mrow><mo>,</mo><mrow><mo stretchy="false">|</mo><mi>U</mi><mo stretchy="false">|</mo></mrow><mo stretchy="false">)</mo></mrow></mrow></mrow></math> and
the missing characters of the shorter word substituted with epsilons. The path
can be extended with arbitrary contexts <math id="S2.SS2.p3.m4" class="ltx_Math" alttext="L,R" display="inline"><mrow><mi>L</mi><mo>,</mo><mi>R</mi></mrow></math>, by concatenating those contexts
on the left and right, respectively. To apply these confusion sets on a word
using a language model, we use the following formula: <math id="S2.SS2.p3.m5" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{confusion}}=\bigcup_{\mathrm{S:U}\in\mathrm{CP}}S:U" display="inline"><mrow><mrow><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi>confusion</mi></msub><mo>=</mo><mrow><msub><mo largeop="true" mathsize="160%" stretchy="false" symmetric="true">⋃</mo><mrow><mi mathvariant="normal">S</mi><mo>:</mo><mrow><mi mathvariant="normal">U</mi><mo>∈</mo><mi>CP</mi></mrow></mrow></msub><mi>S</mi></mrow></mrow><mo>:</mo><mi>U</mi></mrow></math>, where <math id="S2.SS2.p3.m6" class="ltx_Math" alttext="\mathrm{CP}" display="inline"><mi>CP</mi></math> is a set
of confused string pairs. The error model can be applied in a standard manner
in Formula 2. For a more detailed descrip- tion of a finite-state
implementation of Hunspell error models, see Pirinen and
Linden <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib6" title="Creating and weighting Hunspell dictionaries as finite-state automata" class="ltx_ref">22</a>]</cite>.</p>
</div>
</section>
<section id="S2.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3 </span>Combining Weights from Different Sources and Different Models</h3>
<div id="S2.SS3.p1" class="ltx_para">
<p class="ltx_p">As both our language and error models are weighted automata, the weights need
to be combined when applying the error and the language models to a misspelled
string. Since the application performs what is basically a finite-state
composition as defined in Formula 2, the default outcome is a weight semiring
multiplication of the values; i.e., a real number addition in the tropical
semiring. This is a reasonable way to combine the models, which can be used
as a good baseline. In many cases, however, it is preferable to treat the
probabilities or the weights drawn from different sources as unequal in
strength. For example, in many of the existing spelling-checker systems, it is
preferable to first suggest all the corrections that assume only one spelling
error before the ones with two errors, regardless of the likelihood of the word
forms in the language model. To accomplish this, we scale the weights in the
error model to ensure that any weight in the error model is greater than or
equal to any weight in the language model: <math id="S2.SS3.p1.m1" class="ltx_Math" alttext="\hat{w_{e}}=w_{e}+maxw(\mathcal{M}_{\mathrm{S}})" display="inline"><mrow><mover accent="true"><msub><mi>w</mi><mi>e</mi></msub><mo stretchy="false">^</mo></mover><mo>=</mo><mrow><msub><mi>w</mi><mi>e</mi></msub><mo>+</mo><mrow><mi>m</mi><mo></mo><mi>a</mi><mo></mo><mi>x</mi><mo></mo><mi>w</mi><mo></mo><mrow><mo stretchy="false">(</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub><mo stretchy="false">)</mo></mrow></mrow></mrow></mrow></math>, where <math id="S2.SS3.p1.m2" class="ltx_Math" alttext="\hat{w_{e}}" display="inline"><mover accent="true"><msub><mi>w</mi><mi>e</mi></msub><mo stretchy="false">^</mo></mover></math> is the scaled
weight of error model weights, <math id="S2.SS3.p1.m3" class="ltx_Math" alttext="w_{e}" display="inline"><msub><mi>w</mi><mi>e</mi></msub></math> the original error model weight and
<math id="S2.SS3.p1.m4" class="ltx_Math" alttext="maxw(\mathcal{M}_{\mathrm{S}})" display="inline"><mrow><mi>m</mi><mo></mo><mi>a</mi><mo></mo><mi>x</mi><mo></mo><mi>w</mi><mo></mo><mrow><mo stretchy="false">(</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub><mo stretchy="false">)</mo></mrow></mrow></math> the
maximum weight found in the language model used for error corrections.</p>
</div>
</section>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>The Language and Error Model Data Used For Evaluation</h2>
<div id="S3.p1" class="ltx_para">
<p class="ltx_p">To evaluate the weighting schemes and the language and the error models, we
have selected two of the morphologically more complex languages with little to
virtually no corpus resources available: North Sámi and Greenlandic.
Furthermore, as a morphologically complex language with moderate resources,
we have used Finnish. As a comparative baseline for a morphologically simple
language with huge corpus resources, we use English. English is also used
here to reproduce the results of the existing models to verify functionality
of our selected approach.</p>
</div>
<div id="S3.p2" class="ltx_para">
<p class="ltx_p">This section briefly introduces the data and methods to compile the models; for
the exact implementation, for any reproduction of results or for attempts to
implement the same approaches for another language, the reader is advised to
utilize the scripts, the programs and the makefiles available at our source
code repository.
<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup><span class="ltx_ERROR undefined">\url</span>https://github.com/flammie/purplemonkeydishwasher/tree/master/2014-cicling</span></span></span></p>
</div>
<figure id="S3.T1" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 1: </span>The extent of Wikipedia data per language</figcaption>
<table class="ltx_tabular ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Data:</th>
<td class="ltx_td ltx_align_right">Train tokens</td>
<td class="ltx_td ltx_align_right">Train types</td>
<td class="ltx_td ltx_align_right">Test tokens</td>
<td class="ltx_td ltx_align_right">Test types</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Language</th>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">English</th>
<td class="ltx_td ltx_align_right">276,730,786</td>
<td class="ltx_td ltx_align_right">3,216,142</td>
<td class="ltx_td ltx_align_right">111,882,292</td>
<td class="ltx_td ltx_align_right">1,945,878</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Finnish</th>
<td class="ltx_td ltx_align_right">9,779,826</td>
<td class="ltx_td ltx_align_right">1,065,631</td>
<td class="ltx_td ltx_align_right">4,116,896</td>
<td class="ltx_td ltx_align_right">538,407</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">North Sámi</th>
<td class="ltx_td ltx_align_right">183,643</td>
<td class="ltx_td ltx_align_right">38,893</td>
<td class="ltx_td ltx_align_right">33,722</td>
<td class="ltx_td ltx_align_right">8,239</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Greenlandic</th>
<td class="ltx_td ltx_align_right">136,241</td>
<td class="ltx_td ltx_align_right">28,268</td>
<td class="ltx_td ltx_align_right">7,233</td>
<td class="ltx_td ltx_align_right">1,973</td>
</tr>
</tbody>
</table>
</figure>
<div id="S3.p3" class="ltx_para">
<p class="ltx_p">In Table 1, we show the statistics of the data we have drawn from Wikipedia for
training and testing purposes. In case of English and Finnish, the data is
selected from a subset of Wikipedia test tokens. With North Sámi and
Greenlandic, we had no other choice but to use all Wikipedia test tokens.</p>
</div>
<div id="S3.p4" class="ltx_para">
<p class="ltx_p">For the English language model, we use the data from Norvig <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib69" title="How to write a spelling corrector" class="ltx_ref">18</a>]</cite> and
Pirinen and Hardwick <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Effects of weighted finite-state language and error models on speed and efficiency of finite-state spell-checking" class="ltx_ref">21</a>]</cite>, which is a basic language model
based on a frequency weighted wordlist extracted from freely available Internet
corpora such as Wikipedia and project Gutenberg. The language models for North
Sámi, Finnish and Greenlandic are drawn from the free/libre open-source
repository of finite-state language models managed by the University of
Tromsø.<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">5</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">5</sup><span class="ltx_ERROR undefined">\url</span>http://giellatekno.uit.no</span></span></span> The language models are all
based on the morphological analyzers built in the finite-state
morphology <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib66" title="Finite state morphology" class="ltx_ref">2</a>]</cite> fashion. The repository also includes the
basic versions of finite-state spell-checking under the same framework that we
use in this article for testing. To compile our dictionaries, we have used the
makefiles available in the repository. The exact methods for this are also
detailed in the source code of the repository.</p>
</div>
<div id="S3.p5" class="ltx_para">
<p class="ltx_p">The error models for English are combined from a basic edit distance with
English alphabet a-z and the confusion set from Hunspellâs English dictionary
containing 96 confusion pairs<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">6</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">6</sup>as found in en-US.aff from Ubuntu Linux
LTS 12.04</span></span></span>. The error models for North Sámi, Finnish and Greenlandic are the
edit distances of English with addition of Ã¥Ã¤Ã¶Å¡Å¾Ä and <span class="ltx_ERROR undefined">\fontspec</span>Charis SIL ŧ
and for North Sámi and åäöšž for Finnish. For North Sámi we also use the actual
Hunspell parts from the divvun speller<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">7</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">7</sup><span class="ltx_ERROR undefined">\url</span>http://divvun.no</span></span></span>; for
Greenlandic, we have no confusion sets or character likelihoods for
Hunspell-style data, so only the ordering of the Hunspell correction mechanisms
is retained. For English, the Hunspell phonemic folding scheme was not used.
This makes the English results easier to compare with those of other languages,
which do not even have any phonemic error sources.</p>
</div>
<div id="S3.p6" class="ltx_para">
<p class="ltx_p">The keyboard adjacency weighting and optimization for the English error models
is based on a basic qwerty keyboard. The keyboard adjacency values are taken
from the CLDR Version 22<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">8</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">8</sup>http://cldr.unicode.org</span></span></span>, modified to the
standard 101â104 key PC keyboard layout.</p>
</div>
<div id="S3.p7" class="ltx_para">
<p class="ltx_p">The training corpora for each of the languages are based on Wikipedia. To
estimate the weights in the models, we have used the correct word-forms of the
first 90 % of Wikipedia for the language model and the non-words for the error
model. We used the remaining 10 % for extracting non-words for testing. The
error corpus was ex- tracted with a script very similar to the one described by
Max and Wisniewski <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib57" title="Mining naturally-occurring corrections and paraphrases from wikipediaâs revision history" class="ltx_ref">14</a>]</cite>. The script that performs fetching and
cleaning can be found in our repository.
<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">9</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">9</sup><span class="ltx_ERROR undefined">\url</span>https://github.com/flammie/purplemonkeydishwasher/tree/master/2014-cicling</span></span></span>
We have selected the
spelling corrections found in Wikipedia by only taking those, where the
incorrect version does not belong to the language model (i.e. is a non-word
error), and the corrected word-form does.</p>
</div>
<section id="S3.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>The Models Used For Evaluation</h3>
<div id="S3.SS1.p1" class="ltx_para">
<p class="ltx_p">The finite-state language and error models described in this article have a
number of adjustable settings. For weighting our language models, we have
picked a subset of corpus strings for estimating word form probabilities. As
both North Sámi and Greenlandic Wikipedia were quite limited in size, we used
all strings except those that appear only once (hapax legomena) whereas for
Finnish, we set the frequency threshold to 5, and for English, we set it to 20.
For English, we also used all word-forms in the material from Norvigâs corpora,
as we believe that they are already hand-selected to some extent.</p>
</div>
<div id="S3.SS1.p2" class="ltx_para">
<p class="ltx_p">As error models, we have selected the following combinations of basic models:
the basic edit distance consisting of homogeneously weighted errors of the
Levenshtein-Damerau type, the same model limited to the non-first positions of
the word, and the Hunspell version of the edit distance errors (i.e. swaps only
apply to adjacent keys, and deletions and additions are only tried for a
selected alphabet).</p>
</div>
</section>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>The Speed and Quality of Different Finite-State Models and
Weighting Schemes</h2>
<div id="S4.p1" class="ltx_para">
<p class="ltx_p">To evaluate the systems, we have used a modified version of the HFST
spell-checking tool <span class="ltx_text ltx_font_typewriter">hfst-ospell-survey</span> 0.2.4 otherwise using the
default options, but for the speed measurements we have used the
<span class="ltx_text ltx_font_typewriter">--profile</span> argument. The evaluation of speed and memory usage has been
performed by averaging over five test runs on a dedicated test server: an Intel
Xeon E5450 at 3 GHz, with 64 GB of RAM memory. The rest of the section is
organized as follows: in Subsection 4.1, we show naïve coverage baselines. In
Subsection 4.2, we measure the quality of spell-checking with real-world
spelling corrections found in Wikipedia logs. Finally in Subsections 4.3 and
4.4, we provide the speed and memory efficiency figures for these experiments,
respectively.</p>
</div>
<section id="S4.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.1 </span>Coverage Evaluation</h3>
<div id="S4.SS1.p1" class="ltx_para">
<p class="ltx_p">To show the starting point for spell-checking, we measure the coverage of the
lan- guage models. That is, we measure how much of the test data can be
recognized using only the language models, and how many of the word-forms are
beyond the reach of the models. The measurements in Table 2 are measured over
word-forms in running text that can be measured in reasonable time, i.e. no
more than the first 1,000,000 word-forms of each test corpus. As can be seen in
Table 2, the task is very different for languages like English compared with
morphologically more complex languages.</p>
</div>
<figure id="S4.T2" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 2: </span>The word-form coverage of the language models on test data (in %).</figcaption>
<table class="ltx_tabular ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English aspell</td>
<td class="ltx_td ltx_align_right">22.7</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English full automaton</td>
<td class="ltx_td ltx_align_right">80.1</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish full automaton</td>
<td class="ltx_td ltx_align_right">64.8</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North Sámi Hunspell</td>
<td class="ltx_td ltx_align_right">34.4</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North Sámi full automaton</td>
<td class="ltx_td ltx_align_right">48.5</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Greenlandic full automaton</td>
<td class="ltx_td ltx_align_right">25.3</td>
</tr>
</tbody>
</table>
</figure>
</section>
<section id="S4.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.2 </span>Quality Evaluation</h3>
<div id="S4.SS2.p1" class="ltx_para">
<p class="ltx_p">To measure the quality of spell-checking, we have run the list of misspelled
words through the language and error models of our spelling correctors,
extracting all the suggestions. The quality, in Table 3, is measured by the
proportion of correct sugges- tions appearing at a given position 1-5 and
finally the proportion appearing in any remaining positions.</p>
</div>
<div id="S4.SS2.p2" class="ltx_para">
<p class="ltx_p">On the rows indicated with error, in Table 3, we present the baselines for
using language and error models allowing one edit in any position of the word.
The rows with ânon-first errorâ show the same error models with the restriction
that the first letter of the word may not be changed.</p>
</div>
<div id="S4.SS2.p3" class="ltx_para">
<p class="ltx_p">Finally, in Table 3, we also compare the results of our spell-checkers with the
actual systems in everyday use, i.e. the Hunspell and aspell in practice.
When looking at this comparison, we can see that for English data, we actually
provide an overall im- provement already by allowing only one edit per word.
This is mainly due to the weighted language model which works very nicely for
languages like English. The data on North Sámi on the other hand shows no
meaningful improvement neither with the change from Hunspell to our weighted
language models nor with the restriction of the error models.</p>
</div>
<figure id="S4.T3" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 3: </span>The effect of different language and error models on correction quality (precision in % at a given suggestion position)</figcaption>
<table class="ltx_tabular ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Rank:</td>
<td class="ltx_td ltx_align_right">1st</td>
<td class="ltx_td ltx_align_right">2nd</td>
<td class="ltx_td ltx_align_right">3rd</td>
<td class="ltx_td ltx_align_right">4th</td>
<td class="ltx_td ltx_align_right">5th</td>
<td class="ltx_td ltx_align_right">rest</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Language and error models</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English aspell</td>
<td class="ltx_td ltx_align_right">55.7</td>
<td class="ltx_td ltx_align_right">5.7</td>
<td class="ltx_td ltx_align_right">8.0</td>
<td class="ltx_td ltx_align_right">2.2</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English Hunspell</td>
<td class="ltx_td ltx_align_right">59.3</td>
<td class="ltx_td ltx_align_right">5.8</td>
<td class="ltx_td ltx_align_right">3.5</td>
<td class="ltx_td ltx_align_right">2.3</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English w/ 1 error</td>
<td class="ltx_td ltx_align_right">66.7</td>
<td class="ltx_td ltx_align_right">7.0</td>
<td class="ltx_td ltx_align_right">5.2</td>
<td class="ltx_td ltx_align_right">1.8</td>
<td class="ltx_td ltx_align_right">1.8</td>
<td class="ltx_td ltx_align_right">1.8</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">66.7</td>
<td class="ltx_td ltx_align_right">8.8</td>
<td class="ltx_td ltx_align_right">7.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">1.8</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish aspell</td>
<td class="ltx_td ltx_align_right">21.1</td>
<td class="ltx_td ltx_align_right">5.8</td>
<td class="ltx_td ltx_align_right">3.8</td>
<td class="ltx_td ltx_align_right">1.9</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish w/ 1 error</td>
<td class="ltx_td ltx_align_right">54.8</td>
<td class="ltx_td ltx_align_right">19.0</td>
<td class="ltx_td ltx_align_right">7.1</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">54.8</td>
<td class="ltx_td ltx_align_right">21.4</td>
<td class="ltx_td ltx_align_right">4.8</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North Sámi Hunspell</td>
<td class="ltx_td ltx_align_right">9.4</td>
<td class="ltx_td ltx_align_right">3.1</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">3.1</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North Sámi w/ 1 error</td>
<td class="ltx_td ltx_align_right">3.5</td>
<td class="ltx_td ltx_align_right">3.5</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">6.9</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North Sámi w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">3.5</td>
<td class="ltx_td ltx_align_right">3.5</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">6.9</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Greenlandic w/ 1 error</td>
<td class="ltx_td ltx_align_right">13.3</td>
<td class="ltx_td ltx_align_right">2.2</td>
<td class="ltx_td ltx_align_right">6.7</td>
<td class="ltx_td ltx_align_right">2.2</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">8.9</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Greenlandic w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">13.3</td>
<td class="ltx_td ltx_align_right">2.2</td>
<td class="ltx_td ltx_align_right">6.7</td>
<td class="ltx_td ltx_align_right">2.2</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">8.9</td>
</tr>
</tbody>
</table>
</figure>
<div id="S4.SS2.p4" class="ltx_para">
<p class="ltx_p">Some of the trade-offs are efficiency versus quality. In Table 3, we measure
among other things the quality effect of limiting the search space in the error
model. It is important to contrast these results with the speed or memory gains
shown in the corresponding Tables 4 and 5. As we can see, the optimizations
that limit the search space will generally not have a big effect on the
results. Only the results that get cut out of the search space are moved. A few
of the results disappear or move to worse positions.</p>
</div>
</section>
<section id="S4.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.3 </span>Speed Evaluation</h3>
<div id="S4.SS3.p1" class="ltx_para">
<p class="ltx_p">For practical spell-checking systems, there are multiple levels of speed
requirements, so we measure the effects of our different models on speed to see
if the optimal mod- els can actually be used in interactive systems, off-line
corrections, or just batch pro- cessing. In Table 4, we show the speed of
different model combinations for spell- checkingâfor a more thorough evaluation
of the speed of the finite-state language and the error models we refer to
Pirinen et al. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Effects of weighted finite-state language and error models on speed and efficiency of finite-state spell-checking" class="ltx_ref">21</a>]</cite>. We perform three different test sets:
startup time tests to see how much time is spent on startup alone; a running
cor- pus processing test to see how well the system fares when processing
running text;and a non-word correcting test, to see how fast the system is when
producing correc- tions for words. For each test, the results are averaged over
at least 5 runs.</p>
</div>
<figure id="S4.T4" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 4: </span>The effect of different language and error models on speed of spelling correction (startup time in seconds, correction rate in words per second)</figcaption>
<table class="ltx_tabular ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Input: <math id="S4.T4.m1" class="ltx_Math" alttext="1^{\mathrm{st}}" display="inline"><msup><mn>1</mn><mi>st</mi></msup></math> word</td>
<td class="ltx_td ltx_align_right">all words</td>
<td class="ltx_td ltx_align_right">non-words</td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Language and error models</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English Hunspell</td>
<td class="ltx_td ltx_align_right">0.5</td>
<td class="ltx_td ltx_align_right">174</td>
<td class="ltx_td ltx_align_right">40</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English w/ 1 error</td>
<td class="ltx_td ltx_align_right">0.06</td>
<td class="ltx_td ltx_align_right">5,721</td>
<td class="ltx_td ltx_align_right">6,559</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">0.20</td>
<td class="ltx_td ltx_align_right">16,474</td>
<td class="ltx_td ltx_align_right">17,911</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish aspell</td>
<td class="ltx_td ltx_align_right">¡0.1</td>
<td class="ltx_td ltx_align_right">781</td>
<td class="ltx_td ltx_align_right">686</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish w/ 1 error</td>
<td class="ltx_td ltx_align_right">1.0</td>
<td class="ltx_td ltx_align_right">166</td>
<td class="ltx_td ltx_align_right">357</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">1.0</td>
<td class="ltx_td ltx_align_right">303</td>
<td class="ltx_td ltx_align_right">1,886</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North Sámi Hunspell</td>
<td class="ltx_td ltx_align_right">4.51</td>
<td class="ltx_td ltx_align_right">3</td>
<td class="ltx_td ltx_align_right">2</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North Sámi w/ 1 error</td>
<td class="ltx_td ltx_align_right">0.28</td>
<td class="ltx_td ltx_align_right">2,304</td>
<td class="ltx_td ltx_align_right">2,839</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North Sámi w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">0.27</td>
<td class="ltx_td ltx_align_right">5,025</td>
<td class="ltx_td ltx_align_right">7,898</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Greenlandic w/ 1 error</td>
<td class="ltx_td ltx_align_right">1.27</td>
<td class="ltx_td ltx_align_right">49</td>
<td class="ltx_td ltx_align_right">142</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Greenlandic w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">1.25</td>
<td class="ltx_td ltx_align_right">85</td>
<td class="ltx_td ltx_align_right">416</td>
</tr>
</tbody>
</table>
</figure>
<div id="S4.SS3.p2" class="ltx_para">
<p class="ltx_p">In Table 4, we already notice an important aspect of finite-state spelling
correction: the speed is very predictable, and in the same ballpark regardless
of input data. Furthermore, we can readily see that the speed of a
finite-state system in general outperforms Hunspell with both of the language
models we compare. Furthermore, we show the speed gains achieved by cutting the
search space to disallow errors in the first character of a word. This is the
speed-equivalent of Table 3 of the previous section, which clearly shows the
trade-off between speed and quality.</p>
</div>
</section>
<section id="S4.SS4" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.4 </span>Memory Usage Evaluation</h3>
<div id="S4.SS4.p1" class="ltx_para">
<p class="ltx_p">Depending on the use case of the spell-checker, memory usage may also be a
limiting factor. To give an idea of the memory-speed trade-offs that different
finite-state models entail, in Table 5, we provide the memory usage values
when performing the evaluation tasks above. The measurements are performed
with the Valgrind utility and represent the peak memory usage. It needs to be
emphasized that this method, like all of the methods of measuring memory usage
of a program, has its flaws, and the figures can at best be considered rough
estimates.</p>
</div>
<figure id="S4.T5" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 5: </span>The peak memory usage of processes checking and correcting word-forms with various language and error model combinations.</figcaption>
<table class="ltx_tabular ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Measurement:</th>
<td class="ltx_td ltx_align_right">Peak memory usage</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Language and error models</th>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">English Hunspell</th>
<td class="ltx_td ltx_align_right">7.5 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">English w/ 1 error</th>
<td class="ltx_td ltx_align_right">7.0 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">English w/ 1 non-first error</th>
<td class="ltx_td ltx_align_right">7.0 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Finnish aspell</th>
<td class="ltx_td ltx_align_right">186 kB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Finnish w/ 1 error</th>
<td class="ltx_td ltx_align_right">79.3 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Finnish w/ 1 non-first error</th>
<td class="ltx_td ltx_align_right">79.3 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">North Sámi Hunspell</th>
<td class="ltx_td ltx_align_right">151.0 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">North Sámi w/ 1 error</th>
<td class="ltx_td ltx_align_right">31.4 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">North Sámi w/ 1 non-first error</th>
<td class="ltx_td ltx_align_right">31.4 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Greenlandic w/ 1 error</th>
<td class="ltx_td ltx_align_right">300.0 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Greenlandic w/ 1 non-first error</th>
<td class="ltx_td ltx_align_right">300.7 MB</td>
</tr>
</tbody>
</table>
</figure><span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">10</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">10</sup>Drobac & al. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib95" title="Heuristic hyperminimization of finite-state lexicons" class="ltx_ref">11</a>]</cite> report on how to hyper-minimize finite-state lexicons keeping the Greenlandic lexicon at less than 20 MB at runtime which gives a considerable speed-up of loading
with only a small reduction of runtime speed.</span></span></span>
</section>
</section>
<section id="S5" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5 </span>Discussion</h2>
<div id="S5.p1" class="ltx_para">
<p class="ltx_p">The improvement of quality by using simple probabilistic features for
spell-checking is well-studied, e.g. by Church and
Gale <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib90" title="Probability scoring for spelling correction" class="ltx_ref">7</a>]</cite>. In our work, we describe introducing
probabilistic features into a finite-state spell-checking system giving a
similar increase in the quality of the spell-checking suggestions as seen in
previous approaches. The methods are usable for a morphologically varied set