-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathPirinen-2011-cicling-contextspell.html
1085 lines (1064 loc) · 83.8 KB
/
Pirinen-2011-cicling-contextspell.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html><html>
<head>
<title>Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/.</title>
<!--Generated on Fri Sep 29 12:53:30 2017 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on Last Modification: September 29, 2017.-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">Improving Finite-State Spell-Checker Suggestions with
Part of Speech N-Grams<span class="ltx_ERROR undefined">\footnotepubrights</span>Originally published in CICLING 2012
in Delhi <span class="ltx_ERROR undefined">\url</span>http://cicling.org/2012/.</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Tommi A Pirinen
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Miikka Silfverberg
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Krister Lindén
<br class="ltx_break">University of Helsinki
<br class="ltx_break">Department of Modern Languages
<br class="ltx_break">University of Helsinki
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname"> 00014
<br class="ltx_break">tommi.pirinen
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">miikka.silfverberg
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">[email protected]
</span></span>
</div>
<div class="ltx_date ltx_role_creation">Last Modification: September 29, 2017</div>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p">In this paper we demonstrate a finite-state implementation of
context-aware spell checking utilizing an N-gram based part of speech (POS)
tagger to rerank the suggestions from a simple edit-distance based
spell-checker. We demonstrate the benefits of context-aware spell-checking for
English and Finnish and introduce modifications that are necessary to make
traditional N-gram models work for morphologically more complex languages, such
as Finnish.</p>
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p">Spell-checking by computer is perhaps one of the oldest and most researched
applications in the field of language technology starting from the mid 20th
century <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib25" title="A technique for computer detection and correction of spelling errors" class="ltx_ref">3</a>]</cite>. One of the crucial parts of spell-checking—both
from an interactive user-interface point of view and for unsupervised correction
of errors—is the production of spelling suggestions. In this article we test
various finite-state methods for using context and shallow morphological
analysis to improve the suggestions generated by traditional edit distance
measures or unigram frequencies such as simple weighted
finite-state dictionaries trained from word form frequencies as
in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">13</a>]</cite>.</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">The spell-checking task can be split into two parts, i.e. <em class="ltx_emph">detection</em> and
actual <em class="ltx_emph">correction</em> of the spelling errors. The spelling errors can be
detected in text as word forms that are unlikely to belong to the
natural language in question, such as writing ‘cta’ instead of ‘cat’. This form
of spelling errors is commonly called <em class="ltx_emph">non-word (spelling) errors</em>.
Another form of spelling errors is word forms that do not belong to the given
context under certain syntactic or semantic requirements, such as writing
‘their’ instead of ‘there’. This form is correspondingly called <em class="ltx_emph">real-word
(spelling) errors</em>. The non-word type of spelling errors can easily be detected
using a dictionary, whereas the detection of the latter type of errors typically
requires syntactic analysis or probabilistic methods <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib8" title="Ordering the suggestions of a spellchecker without using context*" class="ltx_ref">8</a>]</cite>. For
the purpose of this article we do not distinguish between them, as the same
correction methods can be applied to both.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">The correction of spelling errors usually means generating a list of word forms
belonging to the language for a user to chose from. The mechanism for
generating correction suggestions for the erroneous word-forms is an
<em class="ltx_emph">error-model</em>. The purpose of an error-model is to act as a filter to
revert the mistakes the user typing the erroneous word-form has made. The
simplest and most traditional model for making such corrections is the
Levenshtein-Damerau edit distance algorithm, attributed initially to
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib27" title="Binary codes capable of correcting deletions, insertions, and reversals" class="ltx_ref">6</a>]</cite> and especially in the context of spell-checking to
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib25" title="A technique for computer detection and correction of spelling errors" class="ltx_ref">3</a>]</cite>. The Levenshtein-Damerau edit distance assumes that
spelling errors are one of insertion, deletion or changing of a single
character to another, or swapping two adjacent characters, which models well
the spelling errors caused by an accidental slip of finger on a keyboard. It
was originally discovered that for most languages and spelling errors, this
simple method already covers 80 % of all spelling errors <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib25" title="A technique for computer detection and correction of spelling errors" class="ltx_ref">3</a>]</cite>.
This model is also language-independent, ignoring the differences in character
repertoires of a given language. Various other error models have also been
developed, ranging from confusion sets to phonemic folding <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib18" title="Techniques for automatically correcting words in text" class="ltx_ref">5</a>]</cite>.</p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">In this paper, we evaluate the use of context for further fine-tuning of the
correction suggestions. The context is still not commonly used in
spell-checkers. According to <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib18" title="Techniques for automatically correcting words in text" class="ltx_ref">5</a>]</cite> it was lacking in the majority
of spell-checkers and while the situation may have improved slightly for some
commercial office suite products, the main spell-checkers for open source
environments are still primarily context-ignorant, such as
hunspell<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_ERROR undefined">\url</span>http://hunspell.sf.net</span></span></span> which is widely used in the
open source world. For English, the surface word-form trigrams model has been
demonstrated to be reasonably efficient both for non-word cases
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib28" title="Probability scoring for spelling correction" class="ltx_ref">2</a>]</cite> and for for real-word cases<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib43" title="Context based spelling correction" class="ltx_ref">7</a>]</cite>. As an
additional way to improve the set of suggestions, we propose to use
morphosyntactically relevant analyses in context. In this article, we evaluate
a model with a statistical morphological tagger <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib69" title="Combining statistical models for pos tagging using finite-state calculus" class="ltx_ref">16</a>]</cite>. The
resulting system is in effect similar as described in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib1" title="Contextual spelling correction." class="ltx_ref">11</a>]</cite> for
Spanish<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup>We are grateful for the anonymous reviewer on bringing this
previous work on same methods and similar systems to our knowledge.</span></span></span></p>
</div>
<div id="S1.p5" class="ltx_para">
<p class="ltx_p">The system described is fully built on freely available tools and data,
available for download and use from
<span class="ltx_ERROR undefined">\url</span>http://hfst.svn.sourceforge.net/viewvc/hfst/trunk/cicling-2011-contextspell/.
The only exception to this is the training data for Finnish, since there is no
available morphological training data for Finnish as of yet, the download does
not contain the source material for training but only the trigram models
compiled into binary format automata.</p>
</div>
<div id="S1.p6" class="ltx_para">
<p class="ltx_p">Furthermore, we test the context-based spelling methods using both English and
Finnish language materials to ensure the applicability of the method for
morphologically different languages. The reason for doing this is two-fold;
firstly the fact that English has rather low morphological productivity may
make it behave statistically differently from other languages. On the other
hand, English has the largest amount of freely available text corpora. For
other languages, the availability of free corpora, especially annotated
material, is often seen as a problem.</p>
</div>
<div id="S1.p7" class="ltx_para">
<p class="ltx_p">The article is laid out as follows: In Section <a href="#S2" title="2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, we outline
the implementation of a finite-state context-aware spell-checker and describe
the statistical methods used. In Section <a href="#S3" title="3 Material ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>, we introduce the
corpora and dictionaries used for spell-checking and training material as well
as the corpora used for obtaining the spelling errors with context. In
Section <a href="#S4" title="4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>, we show how the created spelling correctors
improve the results and explain the errors left. In
Section <a href="#S5" title="5 Future Work and Discussion ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>, we compare our work with other current systems
and enumerate possible improvements for both.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Methods</h2>
<div id="S2.p1" class="ltx_para">
<p class="ltx_p">The spelling correction in this article is performed in several phases:
assuming misspelled word <em class="ltx_emph">cta</em> for <em class="ltx_emph">cat</em>, we first apply the error
model to the already known incorrect string <em class="ltx_emph">cta</em> to produce candidates
for probable mistypings. For this purpose we use the Damerau-Levenshtein
edit-distance algorithm in finite-state form. When applied to <em class="ltx_emph">cta</em> we get
all strings with one or two typing mistakes, i.e. <em class="ltx_emph">ata</em>, <em class="ltx_emph">bta</em>,
…, <em class="ltx_emph">acta</em>, <em class="ltx_emph">bcta</em>, …, <em class="ltx_emph">ta</em>, <em class="ltx_emph">ca</em>, …,
<em class="ltx_emph">tca</em>, and the correct <em class="ltx_emph">cat</em>. This set of strings is simultaneously
matched against the language model, which will produce a set of corrections,
such as <em class="ltx_emph">cat</em>, <em class="ltx_emph">act</em> or <em class="ltx_emph">car</em>. Since both the error-model and
the language model contain information on likelihoods of errors and words
respectively, the resulting list will be sorted according to a combination of
the edit distance measure and the probability of the word in a reference
corpus. The rankings based on edit distance alone and the edit distance
combined with word probabilities form our two baseline models.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">The context-based models we introduce here use the suggestion list gained from
a contextless spelling-checker and the context of the words as input to rerank
suggestions based on N-gram models. Each of the suggestions is tried against
the N-gram models, and the ones with higher likelihoods will be lifted. For
example when correcting the misspelling of ‘an’ as ‘anx’ in the sentence “this
is anx example sentence”, as shown in the Table <a href="#S2.T1" title="Table 1 ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>, we have the
surface trigrams {this, is, _}, {is, _, example}, {_, example,
sentence}, and corresponding analysis trigrams {DET, VVBZ, _}, {VVBZ, _,
NN}, {_, NN, NN}. The suggestions for anx at edit distance one include
‘ax’, ‘an’ (one deletion), ‘ant’, ‘and’, ‘any’ (one change) and so on. To rank
the possible suggestions, we substitute <math id="S2.p2.m1" class="ltx_Math" alttext="s_{3}" display="inline"><msub><mi>s</mi><mn>3</mn></msub></math> with the suggestions, and
calculate the likelihood of their analyses.
</p>
</div>
<figure id="S2.T1" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 1: </span>Example trigram combinations</figcaption>
<table class="ltx_tabular ltx_centering ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">this<math id="S2.T1.m1" class="ltx_Math" alttext="{}_{s_{1}}" display="inline"><msub><mi></mi><msub><mi>s</mi><mn>1</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_left ltx_border_t">is<math id="S2.T1.m2" class="ltx_Math" alttext="{}_{s_{2}}" display="inline"><msub><mi></mi><msub><mi>s</mi><mn>2</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_center ltx_border_t">_<math id="S2.T1.m3" class="ltx_Math" alttext="{}_{s_{3}}" display="inline"><msub><mi></mi><msub><mi>s</mi><mn>3</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_right ltx_border_t">example<math id="S2.T1.m4" class="ltx_Math" alttext="{}_{s_{4}}" display="inline"><msub><mi></mi><msub><mi>s</mi><mn>4</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_right ltx_border_t">sentence<math id="S2.T1.m5" class="ltx_Math" alttext="{}_{s_{5}}" display="inline"><msub><mi></mi><msub><mi>s</mi><mn>5</mn></msub></msub></math>
</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_b">DET<math id="S2.T1.m6" class="ltx_Math" alttext="{}_{a_{1}}" display="inline"><msub><mi></mi><msub><mi>a</mi><mn>1</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_left ltx_border_b">VVBZ<math id="S2.T1.m7" class="ltx_Math" alttext="{}_{a_{2}}" display="inline"><msub><mi></mi><msub><mi>a</mi><mn>2</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_center ltx_border_b">_ <math id="S2.T1.m8" class="ltx_Math" alttext="{}_{a_{3}}" display="inline"><msub><mi></mi><msub><mi>a</mi><mn>3</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_right ltx_border_b">NN<math id="S2.T1.m9" class="ltx_Math" alttext="{}_{a_{4}}" display="inline"><msub><mi></mi><msub><mi>a</mi><mn>4</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_right ltx_border_b">NN<math id="S2.T1.m10" class="ltx_Math" alttext="{}_{a_{5}}" display="inline"><msub><mi></mi><msub><mi>a</mi><mn>5</mn></msub></msub></math>
</td>
</tr>
</tbody>
</table>
</figure>
<section id="S2.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.1 </span>Weighted Finite-State Interpretation of the Method</h3>
<div id="S2.SS1.p1" class="ltx_para">
<p class="ltx_p">In this article we use a finite-state formulation of
spell-checking. We assume the standard notation for finite-state
algebra and define the language model as a weighted finite-state
automaton assigning a weight to each correctly spelled word-form of a
language, and an error model automaton mapping a misspelled string to
a set of corrected strings and their weights. The probabilistic
interpretation of the components is such that the weighted fsa as a
language model assigns weight <math id="S2.SS1.p1.m1" class="ltx_Math" alttext="w(s)" display="inline"><mrow><mi>w</mi><mo></mo><mrow><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">)</mo></mrow></mrow></math> to word <math id="S2.SS1.p1.m2" class="ltx_Math" alttext="s" display="inline"><mi>s</mi></math> corresponding to the
probability <math id="S2.SS1.p1.m3" class="ltx_Math" alttext="P(s)" display="inline"><mrow><mi>P</mi><mo></mo><mrow><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">)</mo></mrow></mrow></math> for the word to be a correct word in the
language. The error model assigns weight <math id="S2.SS1.p1.m4" class="ltx_Math" alttext="w(s:r)" display="inline"><mrow><mi>w</mi><mrow><mo stretchy="false">(</mo><mi>s</mi><mo>:</mo><mi>r</mi><mo stretchy="false">)</mo></mrow></mrow></math> to string pair <math id="S2.SS1.p1.m5" class="ltx_Math" alttext="s,r" display="inline"><mrow><mi>s</mi><mo>,</mo><mi>r</mi></mrow></math> corresponding to the probability <math id="S2.SS1.p1.m6" class="ltx_Math" alttext="P(s|r)" display="inline"><mrow><mi>P</mi><mrow><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">|</mo><mi>r</mi><mo stretchy="false">)</mo></mrow></mrow></math> of a user writing word r
when intending to write the word <math id="S2.SS1.p1.m7" class="ltx_Math" alttext="s" display="inline"><mi>s</mi></math>, and the context model assigns
weight <math id="S2.SS1.p1.m8" class="ltx_Math" alttext="w(s_{3}a_{3})" display="inline"><mrow><mi>w</mi><mo></mo><mrow><mo stretchy="false">(</mo><mrow><msub><mi>s</mi><mn>3</mn></msub><mo></mo><msub><mi>a</mi><mn>3</mn></msub></mrow><mo stretchy="false">)</mo></mrow></mrow></math> to word <math id="S2.SS1.p1.m9" class="ltx_Math" alttext="s_{3}" display="inline"><msub><mi>s</mi><mn>3</mn></msub></math> with associated POS tagging <math id="S2.SS1.p1.m10" class="ltx_Math" alttext="a_{3}" display="inline"><msub><mi>a</mi><mn>3</mn></msub></math>
corresponding to the standard HMM estimate <math id="S2.SS1.p1.m11" class="ltx_Math" alttext="P(a_{3}s_{3})" display="inline"><mrow><mi>P</mi><mo></mo><mrow><mo stretchy="false">(</mo><mrow><msub><mi>a</mi><mn>3</mn></msub><mo></mo><msub><mi>s</mi><mn>3</mn></msub></mrow><mo stretchy="false">)</mo></mrow></mrow></math> of the analysis being in
a 3-gram context given by equation (<a href="#S2.E1" title="(1) ‣ 2.1 Weighted Finite-State Interpretation of the Method ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>).</p>
<table id="S2.E1" class="ltx_equation ltx_eqn_table">
<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S2.E1.m1" class="ltx_Math" alttext="P(a_{3}s_{3})=\prod_{i=3}^{5}P(s_{i}|a_{i})P(a_{i}|a_{i-2},\ a_{i-1})" display="block"><mrow><mi>P</mi><mrow><mo stretchy="false">(</mo><msub><mi>a</mi><mn>3</mn></msub><msub><mi>s</mi><mn>3</mn></msub><mo stretchy="false">)</mo></mrow><mo>=</mo><munderover><mo largeop="true" movablelimits="false" symmetric="true">∏</mo><mrow><mi>i</mi><mo>=</mo><mn>3</mn></mrow><mn>5</mn></munderover><mi>P</mi><mrow><mo stretchy="false">(</mo><msub><mi>s</mi><mi>i</mi></msub><mo stretchy="false">|</mo><msub><mi>a</mi><mi>i</mi></msub><mo stretchy="false">)</mo></mrow><mi>P</mi><mrow><mo stretchy="false">(</mo><msub><mi>a</mi><mi>i</mi></msub><mo stretchy="false">|</mo><msub><mi>a</mi><mrow><mi>i</mi><mo>-</mo><mn>2</mn></mrow></msub><mo rspace="7.5pt">,</mo><msub><mi>a</mi><mrow><mi>i</mi><mo>-</mo><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td>
</tr>
</table>
</div>
<div id="S2.SS1.p2" class="ltx_para">
<p class="ltx_p">In a weighted finite-state system, the probabilistic data needs to be converted
to the algebra supported by the finite-state weight structure. In this case we
use the tropical semi-ring by transforming the frequencies into penalty weights
with the formula <math id="S2.SS1.p2.m1" class="ltx_Math" alttext="-\log\frac{f}{CS}" display="inline"><mrow><mo>-</mo><mrow><mi>log</mi><mo></mo><mfrac><mi>f</mi><mrow><mi>C</mi><mo></mo><mi>S</mi></mrow></mfrac></mrow></mrow></math>, where <math id="S2.SS1.p2.m2" class="ltx_Math" alttext="f" display="inline"><mi>f</mi></math> is the frequency and <math id="S2.SS1.p2.m3" class="ltx_Math" alttext="CS" display="inline"><mrow><mi>C</mi><mo></mo><mi>S</mi></mrow></math> the
corpus size in number of tokens. If the language model allows for words that
are not in the dictionary, a maximal weight is assigned to the unseen word
forms that may be in the language model but not in the training corpus, i.e.
any unseen word has a penalty weight of <math id="S2.SS1.p2.m4" class="ltx_Math" alttext="-\log\frac{1}{CS}" display="inline"><mrow><mo>-</mo><mrow><mi>log</mi><mo></mo><mfrac><mn>1</mn><mrow><mi>C</mi><mo></mo><mi>S</mi></mrow></mfrac></mrow></mrow></math>.</p>
</div>
<div id="S2.SS1.p3" class="ltx_para">
<p class="ltx_p">The spelling corrections suggested by these unigram lexicon-based
spell-checkers are initially generated by composing an edit-distance
automaton <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib19" title="Typographical nearest-neighbor search in a finite-state lexicon and its application to spelling correction" class="ltx_ref">15</a>]</cite> with an error weight corresponding to the
probability of the error estimated in a corpus, i.e. <math id="S2.SS1.p3.m1" class="ltx_Math" alttext="-\log\frac{f_{F}}{CS+1}" display="inline"><mrow><mo>-</mo><mrow><mi>log</mi><mo></mo><mfrac><msub><mi>f</mi><mi>F</mi></msub><mrow><mrow><mi>C</mi><mo></mo><mi>S</mi></mrow><mo>+</mo><mn>1</mn></mrow></mfrac></mrow></mrow></math>,
where <math id="S2.SS1.p3.m2" class="ltx_Math" alttext="f_{F}" display="inline"><msub><mi>f</mi><mi>F</mi></msub></math> is the frequency of the misspelling in a corpus. This weight is
attached to the edit distance type error. In practice, this typically still
means that the corrections are initially ordered primarily by the edit distance
of the correction, and secondarily by the unigram frequency of the word-form in
the reference corpus. This order is implicitly encoded in the weighted paths
of the resulting automaton; to list the corrections we use the n-best paths
algorithm <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib60" title="An efficient algorithm for the n-best-strings problem" class="ltx_ref">9</a>]</cite>. This ordering is also used as our second baseline.</p>
</div>
<div id="S2.SS1.p4" class="ltx_para">
<p class="ltx_p">For a context-based reordering of the corrections, we use the POS tagging
probabilities for the given suggestions. The implementation of the analysis
N-gram probability estimation is similar to the one described in
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib69" title="Combining statistical models for pos tagging using finite-state calculus" class="ltx_ref">16</a>]</cite> with the following adaptations for the spelling
correction. For the suggestion which gives the highest ranking, the most likely
analysis is selected. The N-gram probability is estimated separately for each
spelling suggestion and then combined with the baseline probability given by
the unigram probability and the edit distance weight. The ideal scaling for the
weights of unigram probabilities, i.e. edit distance probabilities with
respect to N-gram probabilities, can be acquired by e.g.g tuning the scaling
parameter on an automatically generated spelling error corpus.</p>
</div>
<div id="S2.SS1.p5" class="ltx_para">
<p class="ltx_p">The resulting finite-state system consists of three sets of automata, i.e. the
dictionary for spell-checking, the error-model as described in
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">13</a>]</cite>, and the new N-gram model automata. The automata sizes
are given in Table <a href="#S2.T2" title="Table 2 ‣ 2.1 Weighted Finite-State Interpretation of the Method ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> for reference. The sizes also give an
estimate of the memory usage of the spell-checking system, although the actual
memory-usage during correction will rise depending on the actual extent of the
search space during the correction phase.</p>
</div>
<figure id="S2.T2" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 2: </span>Automata sizes.</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Automaton</th>
<td class="ltx_td ltx_align_right">States</td>
<td class="ltx_td ltx_align_right">Transitions</td>
<td class="ltx_td ltx_align_right">Bytes</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t" colspan="4"><span class="ltx_text ltx_font_bold">English</span></th>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Dictionary</th>
<td class="ltx_td ltx_align_right ltx_border_t">25,330</td>
<td class="ltx_td ltx_align_right ltx_border_t">42,448</td>
<td class="ltx_td ltx_align_right ltx_border_t">1.2 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Error model</th>
<td class="ltx_td ltx_align_right">1,303</td>
<td class="ltx_td ltx_align_right">492,232</td>
<td class="ltx_td ltx_align_right">5.9 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">N-gram lexicon</th>
<td class="ltx_td ltx_align_right">363,053</td>
<td class="ltx_td ltx_align_right">1,253,315</td>
<td class="ltx_td ltx_align_right">42 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">N-gram sequences</th>
<td class="ltx_td ltx_align_right">46,517</td>
<td class="ltx_td ltx_align_right">200,168</td>
<td class="ltx_td ltx_align_right">4.2 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t" colspan="4"><span class="ltx_text ltx_font_bold">Finnish</span></th>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Dictionary</th>
<td class="ltx_td ltx_align_right ltx_border_t">179,035</td>
<td class="ltx_td ltx_align_right ltx_border_t">395,032</td>
<td class="ltx_td ltx_align_right ltx_border_t">16 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Error model</th>
<td class="ltx_td ltx_align_right">1,863</td>
<td class="ltx_td ltx_align_right">983,227</td>
<td class="ltx_td ltx_align_right">12 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">N-gram lexicon</th>
<td class="ltx_td ltx_align_right">70,665</td>
<td class="ltx_td ltx_align_right">263,298</td>
<td class="ltx_td ltx_align_right">8.0 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_b">N-gram sequences</th>
<td class="ltx_td ltx_align_right ltx_border_b">3,325</td>
<td class="ltx_td ltx_align_right ltx_border_b">22,418</td>
<td class="ltx_td ltx_align_right ltx_border_b">430 KiB</td>
</tr>
</tbody>
</table>
</figure>
</section>
<section id="S2.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2 </span>English-Specific Finite-State Weighting Methods</h3>
<div id="S2.SS2.p1" class="ltx_para">
<p class="ltx_p">The language model for English was created as described in
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib44" title="How to write a spelling corrector" class="ltx_ref">10</a>]</cite><span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>The finite-state formulation of this is informally
described in
<span class="ltx_ERROR undefined">\url</span>http://blogs.helsinki.fi/tapirine/2011/07/21/how-to-write-an-hfst-spelling-corrector/</span></span></span>.
It consists of the word-forms and their probabilities in the corpora. The edit
distance is composed of the standard English alphabet with an estimated error
likelihood of 1 in 1000 words. Similarly for the English N-gram material, the
initial analyses found in the WSJ corpus were used in the finite-state tagger
as such. The scaling factor between the dictionary probability model and the
edit distance model was acquired by estimating the optimal multipliers using
the automatic misspellings and corrections of a Project Gutenberg
Ebook<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup><span class="ltx_ERROR undefined">\url</span>http://www.gutenberg.org/cache/epub/11/pg11.txt</span></span></span>
<em class="ltx_emph">Alice’s Adventures in Wonderland</em>. In here the estimation simply means
trying out factors until results are stable and picking the best one.</p>
</div>
</section>
<section id="S2.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3 </span>Finnish-Specific Finite-State Weighting Methods</h3>
<div id="S2.SS3.p1" class="ltx_para">
<p class="ltx_p">The Finnish language model was based on a readily-available morphological
weighted analyser of Finnish language <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib9" title="Modularisation of finnish finite-state language description—towards wide collaboration in open source development of a morphological analyser" class="ltx_ref">14</a>]</cite>. We further
modified the automaton to penalize suggestions with newly created compounds and
derivations by adding a weight greater than the maximum to such suggestions,
i.e. <math id="S2.SS3.p1.m1" class="ltx_Math" alttext="-A\log\frac{1}{CS+1}" display="inline"><mrow><mo>-</mo><mrow><mi>A</mi><mo></mo><mrow><mi>log</mi><mo></mo><mfrac><mn>1</mn><mrow><mrow><mi>C</mi><mo></mo><mi>S</mi></mrow><mo>+</mo><mn>1</mn></mrow></mfrac></mrow></mrow></mrow></math>, where <math id="S2.SS3.p1.m2" class="ltx_Math" alttext="A" display="inline"><mi>A</mi></math> is the scaling factor acquired from the
training material. This has nearly the same effect as using a separate
dictionary for suggestions that excludes the heavily weighted forms without
requiring the extra space. Also for Finnish, a scaling factor was estimated by
using automatic misspellings and corrections of a Project Gutenberg
Ebook<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">5</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">5</sup><span class="ltx_ERROR undefined">\url</span>http://www.gutenberg.org/cache/epub/10863/pg10863.txt</span></span></span>
<em class="ltx_emph">Juha</em>.</p>
</div>
<div id="S2.SS3.p2" class="ltx_para">
<p class="ltx_p">In the initial Finnish tagger, there was a relatively large tagset, all of
which did not contain information necessary for the task of
spell-checking, such as discourse particles, which are relatively
context-agnostic <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib50" title="Iso suomen kielioppi" class="ltx_ref">4</a>]</cite>, so we opted to simplify the tagging in these
cases. Furthermore, the tagger used for training produced heuristic readings for
unrecognized word-forms, which we also removed. Finally, we needed to add
some extra penalties to the word forms unknown to the dictionary in the
N-gram model, since this phenomenon was more frequent and diverse for Finnish
than English. The extra penalties were acquired by iterative testing on the
correction material using generated errors.</p>
</div>
</section>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Material</h2>
<div id="S3.p1" class="ltx_para">
<p class="ltx_p">To train the spell-checker lexicons, word-form probabilities can be acquired
from arbitrary running text. By using unigram frequencies, we can assign
all word-forms some initial probabilities in isolation, i.e. with no spell-checking context.
The unigram-trained models we used were acquired from existing
spell-checker systems <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib44" title="How to write a spelling corrector" class="ltx_ref">10</a>, <a href="#bib.bib12" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">13</a>]</cite>, but we briefly
describe the used corpora here as well.</p>
</div>
<div id="S3.p2" class="ltx_para">
<p class="ltx_p">To train the various N-gram models, corpora are required. For the surface-form
training material, it is sufficient to capture running N-grams in the text.
For training the statistical tagger with annotations, we also require
disambiguated readings. Ideally of course this means hand-annotated
tree banks or similar gold standard corpora.</p>
</div>
<div id="S3.p3" class="ltx_para">
<p class="ltx_p">The corpora used are summarized in Table <a href="#S3.T3" title="Table 3 ‣ 3 Material ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>. The sizes are
provided to make it possible to reconstruct the systems. In practice, they are the newest
available versions of the respective corpora at the time of testing. In the
table, the first row is the training material used for the finite-state
lexicon, i.e. the extracted surface word-forms without the analyses for unigram
training. The second row is for the analyzed and disambiguated material for the
N-gram based taggers for suggestion improvement. The third line is the corpora
of spelling errors used only for the evaluation of the systems. As we can see
from the figures of English compared with Finnish, there is a significant
difference in freely available corpora such as Wikipedia. When going further to
lesser resourced languages, the number will drop enough to make such statistical
approaches less useful, e.g. Northern Sámi in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">13</a>]</cite>.</p>
</div>
<figure id="S3.T3" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 3: </span>Sizes of training and evaluation corpora.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_center ltx_border_t">Sentences</td>
<td class="ltx_td ltx_align_center ltx_border_t">Tokens</td>
<td class="ltx_td ltx_align_center ltx_border_t">Word-forms</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t" colspan="4"><span class="ltx_text ltx_font_bold">English</span></th>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Unigrams</th>
<td class="ltx_td ltx_border_t"></td>
<td class="ltx_td ltx_align_center ltx_border_t">2,110,728,338</td>
<td class="ltx_td ltx_align_center ltx_border_t">128,457</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">N-grams</th>
<td class="ltx_td ltx_align_center">42,164</td>
<td class="ltx_td ltx_align_center">969,905</td>
<td class="ltx_td ltx_align_center">39,690</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Errors</th>
<td class="ltx_td ltx_align_center">85</td>
<td class="ltx_td ltx_align_center">606</td>
<td class="ltx_td ltx_align_center">217</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t" colspan="4"><span class="ltx_text ltx_font_bold">Finnish</span></th>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Unigrams</th>
<td class="ltx_td ltx_border_t"></td>
<td class="ltx_td ltx_align_center ltx_border_t">17,479,297</td>
<td class="ltx_td ltx_align_center ltx_border_t">968,996</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">N-grams</th>
<td class="ltx_td ltx_align_center">98,699</td>
<td class="ltx_td ltx_align_center">1,027,514</td>
<td class="ltx_td ltx_align_center">144,658</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_b">Errors</th>
<td class="ltx_td ltx_align_center ltx_border_b">333</td>
<td class="ltx_td ltx_align_center ltx_border_b">4,177</td>
<td class="ltx_td ltx_align_center ltx_border_b">2,762</td>
</tr>
</tbody>
</table>
</figure>
<section id="S3.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>English corpora</h3>
<div id="S3.SS1.p1" class="ltx_para">
<p class="ltx_p">The English dictionary is based on a frequency weighted word-form list of
the English language as proposed in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib44" title="How to write a spelling corrector" class="ltx_ref">10</a>]</cite>. The word-forms were
collected from the English Wiktionary<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">6</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">6</sup><span class="ltx_ERROR undefined">\url</span>http://en.wiktionary.org</span></span></span>,
the English EBooks from the project
Gutenberg<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">7</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">7</sup><span class="ltx_ERROR undefined">\url</span>http://www.gutenberg.org/browse/languages/en</span></span></span> and the
British National
Corpus<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">8</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">8</sup><span class="ltx_ERROR undefined">\url</span>http://www.kilgarriff.co.uk/bnc-readme.html</span></span></span>. This
frequency weighted word-list is in effect used as a unigram lexicon for spell-checking.</p>
</div>
<div id="S3.SS1.p2" class="ltx_para">
<p class="ltx_p">To train an English morphosyntactic tagger, we use the WSJ corpus. In
this corpus each word is annotated by a single tag that encodes some
morphosyntactic information, such as part-of-speech and inflectional
form. The total number of tags in this corpus is <math id="S3.SS1.p2.m1" class="ltx_Math" alttext="77" display="inline"><mn>77</mn></math>.</p>
</div>
<div id="S3.SS1.p3" class="ltx_para">
<p class="ltx_p">The spelling errors of English were acquired by extracting the ones with
context from the Birkbeck error
corpus<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">9</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">9</sup><span class="ltx_ERROR undefined">\url</span>http://ota.oucs.ox.ac.uk/headers/0643.xml</span></span></span>. In this
corpus, the errors are from a variety of sources, including errors made by
children and language-learners. For the purpose of this experiment we picked
the subset of errors which had context and also removed the cases of word
joining and splitting to simplify the implementation of parsing and suggestion.
When interpreting results it should be noted that many of these English errors
are competence errors while the baseline algorithm used to model errors here is
for typing errors.</p>
</div>
</section>
<section id="S3.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.2 </span>Finnish Corpora</h3>
<div id="S3.SS2.p1" class="ltx_para">
<p class="ltx_p">As the Finnish dictionary, we selected the freely available open source
finite-state implementation of a Finnish morphological
analyser<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">10</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">10</sup><span class="ltx_ERROR undefined">\url</span>http://home.gna.org/omorfi</span></span></span>. The analyser had the
frequency-weighted word-form list based on Finnish
Wikipedia<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">11</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">11</sup><span class="ltx_ERROR undefined">\url</span>http://download.wikipedia.org/fiwiki/</span></span></span> making it in
practice an extended unigram lexicon for Finnish. The Finnish morphological
analyser, however, is capable of infinite compounding and derivation, which
makes it a notably different approach to spell checking than the English finite
word-form list.</p>
</div>
<div id="S3.SS2.p2" class="ltx_para">
<p class="ltx_p">The Finnish morphosyntactic N-gram model was trained using a
morphosyntactically analyzed Finnish
Newspaper<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">12</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">12</sup><span class="ltx_ERROR undefined">\url</span>http://www.csc.fi/english/research/software/ftc</span></span></span>. In
this format, the annotation is based on a sequence of tags, encoding
part of speech and inflectional form. The total number of different
tag sequences for this annotation is 747.</p>
</div>
<div id="S3.SS2.p3" class="ltx_para">
<p class="ltx_p">For Finnish spelling errors, we ran the Finnish unigram spell-checker through
Wikipedia, europarl and a corpus of Finnish EBooks from the project
Gutenberg<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">13</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">13</sup><span class="ltx_ERROR undefined">\url</span>http://www.gutenberg.org/browse/languages/fi</span></span></span> to
acquire the non-word spelling errors, and picked at random the errors having
frequencies in range 1 to 8 instances; a majority of higher frequency non-words
were actually proper nouns or neologisms missing from the dictionary. Using all of
Wikipedia, europarl and Gutenberg provides a reasonable variety of both
contemporary and old texts in a wide range of styles.</p>
</div>
</section>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>Tests and Evaluation</h2>
<div id="S4.p1" class="ltx_para">
<p class="ltx_p">The evaluation of the correction suggestion quality is described in
Table <a href="#S4.T4" title="Table 4 ‣ 4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>. The Table <a href="#S4.T4" title="Table 4 ‣ 4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a> contains the
precision for the spelling errors. The precision is measured by ranked
suggestions. In the tables, we give the results separately for ranks 1—5, and
then for the accumulated ranks 1—10. The rows of the table represent
different combinations of the N-gram models. The first row is a baseline score
achieved by the weighted edit distance model alone, and the second is with
unigram-weighted dictionary over edit-distance 2. The last two columns are the
traditional word-form N-gram model and our POS tagger based extension to it.</p>
</div>
<figure id="S4.T4" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 4: </span>Precision of suggestion algorithms with real spelling errors.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Algorithm</td>
<td class="ltx_td ltx_align_right ltx_border_t">1</td>
<td class="ltx_td ltx_align_right ltx_border_t">2</td>
<td class="ltx_td ltx_align_right ltx_border_t">3</td>
<td class="ltx_td ltx_align_right ltx_border_t">4</td>
<td class="ltx_td ltx_align_right ltx_border_t">5</td>
<td class="ltx_td ltx_align_right ltx_border_t">1—10</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_center ltx_border_t" colspan="7"><span class="ltx_text ltx_font_bold">English</span></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 (baseline)</td>
<td class="ltx_td ltx_align_right ltx_border_t">25.9 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">2.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">2.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">1.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.5 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">94.1 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 with Unigrams</td>
<td class="ltx_td ltx_align_right ltx_border_t">28.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">5.9 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">29.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.5%</td>
<td class="ltx_td ltx_align_right ltx_border_t">28.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">97.6 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 with Word N-grams</td>
<td class="ltx_td ltx_align_right ltx_border_t">29.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">10.6 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">34.1 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">5.9 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">14.1 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">97.7 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 with POS N-grams</td>
<td class="ltx_td ltx_align_right ltx_border_t">68.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">18.8 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.5 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">2.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">0.0 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">92.9 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_center ltx_border_t" colspan="7"><span class="ltx_text ltx_font_bold">Finnish</span></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 (baseline)</td>
<td class="ltx_td ltx_align_right ltx_border_t">66.5 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">8.7 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">4.0 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">4.7 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">1.9 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">89.8 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 with Unigrams</td>
<td class="ltx_td ltx_align_right ltx_border_t">61.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">13.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">1.6 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.1 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">88.2 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 with Word N-grams</td>
<td class="ltx_td ltx_align_right ltx_border_t">65.0 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">14.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.8 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.1 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">2.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">90.6 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_t">Edit distance 2 with POS N-grams</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">71.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">9.3 %</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">1.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">3.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">0.3 %</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">85.7 %</td>
</tr>
</tbody>
</table>
</figure>
<div id="S4.p2" class="ltx_para">
<p class="ltx_p">It would appear that POS N-grams will in both cases give a significant boost to
the results, whereas the word-form N-grams will merely give a slight increase to
the results. In the next subsections we further dissect the specific changes to
results the different approaches give.</p>
</div>
<section id="S4.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.1 </span>English Error-Analysis</h3>
<div id="S4.SS1.p1" class="ltx_para">
<p class="ltx_p">In <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib44" title="How to write a spelling corrector" class="ltx_ref">10</a>]</cite>, the authors identify errors that are not solved using
simple unigram weights, such as correcting <em class="ltx_emph">rember</em> to <em class="ltx_emph">remember</em>
instead of <em class="ltx_emph">member</em>. Here, our scaled POS N-gram context-model as well
as the simpler word N-gram model, which can bypass the
edit distance model weight will select the correct suggestion. However, when
correcting e.g. <em class="ltx_emph">ment</em> to <em class="ltx_emph">meant</em> in stead of <em class="ltx_emph">went</em> or
<em class="ltx_emph">met</em> the POS based context reranking gives no help as the POS stays the
same.</p>
</div>
</section>
<section id="S4.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.2 </span>Finnish Error-Analysis</h3>
<div id="S4.SS2.p1" class="ltx_para">
<p class="ltx_p">In Finnish results we can easily notice that variation within the first
position in the baseline results and reference system is more sporadic. This
can be traced back to the fuzz factor caused by a majority of probabilities
falling into the same category in our tests. The same edit-distance and unigram
probability leaves the decision to random factors irrelevant to this
experiment, such as alphabetical ordering that comes from data structures
backing up the program code. The N-gram based reorderings are the only methods
that can tiebreak the results here.</p>
</div>
<div id="S4.SS2.p2" class="ltx_para">
<p class="ltx_p">An obvious improvement for Finnish with POS N-grams comes from correcting
agreeing NP’s towards case agreement, such as <em class="ltx_emph">yhdistetstä</em> to
<em class="ltx_emph">yhdisteistä</em> (‘of compounds’ <span class="ltx_text ltx_font_smallcaps">Pl Ela</span>) instead of the statistically
more common <em class="ltx_emph">yhdisteestä</em> (‘of compound’ <span class="ltx_text ltx_font_smallcaps">Sg Ela</span>). However, as
with English, the POS information does fail to rerank cases where two equally
rare word-forms with the same POS occur at the same edit distance, which seems
to be common with participles, such as correcting <em class="ltx_emph">varustunut</em> to
<em class="ltx_emph">varautunut</em> in stead of <em class="ltx_emph">varastanut</em>.</p>
</div>
<div id="S4.SS2.p3" class="ltx_para">
<p class="ltx_p">Furthermore we note that the the discourse particles that were dropped from the
POS tagger’s analysis tag set in order to decrease the tag set size will cause
certain word forms in the dictionary to be incorrectly reranked, such as when
correcting the very common misspelling <em class="ltx_emph">muillekkin</em> into
<em class="ltx_emph">muillekokin</em> (‘for others as well?’ <span class="ltx_text ltx_font_smallcaps">Pl All Qst Kin</span>) instead of
the originally correct <em class="ltx_emph">muillekin</em> (‘for others as well’ <span class="ltx_text ltx_font_smallcaps">Pl All
Kin</span>), since the analyses <span class="ltx_text ltx_font_smallcaps">Qst</span> (for question enclitic) and <span class="ltx_text ltx_font_smallcaps">Kin</span>
(for additive enclitic) are both dropped from the POS analyses.</p>
</div>
</section>
<section id="S4.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.3 </span>Performance Evaluation</h3>
<div id="S4.SS3.p1" class="ltx_para">
<p class="ltx_p">We did not work on optimizing the N-gram analysis and selection,
but we found that the speed of the system is reasonable—even in its current
form, considering that the algorithm is applied only to incorrect words on
the user’s request. Table <a href="#S4.T5" title="Table 5 ‣ 4.3 Performance Evaluation ‣ 4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a> summarizes the average
speed of performing the experiments in Table <a href="#S4.T4" title="Table 4 ‣ 4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>.</p>
</div>
<figure id="S4.T5" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 5: </span>The speed of ranking the errors.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Material</th>
<td class="ltx_td ltx_align_center ltx_border_t">English</td>
<td class="ltx_td ltx_align_center ltx_border_t">Finnish</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Algorithm</th>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Unigram (baseline)</th>
<td class="ltx_td ltx_align_center ltx_border_t">10.0 s</td>
<td class="ltx_td ltx_align_center ltx_border_t">51.8 s</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_center">399.1 wps</td>
<td class="ltx_td ltx_align_center">6.2 wps</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">POS N-grams</th>
<td class="ltx_td ltx_align_center ltx_border_t">377.4 s</td>
<td class="ltx_td ltx_align_center ltx_border_t">1616.2 s</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row ltx_border_b"></th>
<td class="ltx_td ltx_align_center ltx_border_b">10.6 wps</td>
<td class="ltx_td ltx_align_center ltx_border_b">0.14 wps</td>
</tr>
</tbody>
</table>
</figure>
<div id="S4.SS3.p2" class="ltx_para">
<p class="ltx_p">The performance penalty that is incurred on Finnish spell-checking but not so
much on English comes from the method of determining readings for words unknown
to the language model, i.e. from the guessing algorithm. The amount of words
unknown to the language model in Finnish was greater than for English due to
the training data sparseness and the morphological complexity of the language.</p>
</div>
</section>
</section>
<section id="S5" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5 </span>Future Work and Discussion</h2>
<div id="S5.p1" class="ltx_para">
<p class="ltx_p">In this work we recreated the results of basic and context-based spelling
correction for English and implemented same system for Finnish. We have shown
that the POS based N-gram models are suitable for improving the spelling
corrections for both morphologically more complex languages such as Finnish and
for further improving languages with simpler morphologies like English. To
further verify the claim, the method may still need to be tested on a
typologically wider spectrum of languages.</p>
</div>
<div id="S5.p2" class="ltx_para">
<p class="ltx_p">In this article, we used readily available and hand-made error corpora to test
our error correction method. A similar method as the one we use for error
correction should be possible in error detection as well, especially when
detecting real-word errors <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib43" title="Context based spelling correction" class="ltx_ref">7</a>]</cite>. In future research, an obvious
development is to integrate the N-gram system as a part of a real spell-checker
system for both detection and correction of spelling errors, as is already done
for the unigram based spell checker demonstrated in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">13</a>]</cite>.</p>
</div>
<div id="S5.p3" class="ltx_para">
<p class="ltx_p">The article discussed only the reranking over basic edit distance error
models, further research should include more careful statistical training for
the error model as well, such as one demonstrated in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib42" title="An improved error model for noisy channel spelling correction" class="ltx_ref">1</a>]</cite>.</p>
</div>
</section>
<section id="S6" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">6 </span>Conclusion</h2>
<div id="S6.p1" class="ltx_para">
<p class="ltx_p">In this paper we have demonstrated the use of finite-state methods for trigram
based generation of spelling suggestions. We have shown that the basic
word-form trigram methods suggested for languages like English do not seem to
be as useful without modification for morphologically more complex languages
like Finnish. Instead a more elaborate N-gram scheme using POS n-grams is
successful for Finnish as well as English.</p>
</div>
</section>
<section id="Sx1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">Acknowledgements</h2>
<div id="Sx1.p1" class="ltx_para">
<p class="ltx_p">We are grateful to Sam Hardwick for making the spell checking software
available and the HFST research group for fruitful discussions. We also thank
the anonymous reviewers for useful suggestions and pointers.
</p>
</div>
</section>
<section id="bib" class="ltx_bibliography">
<h2 class="ltx_title ltx_title_bibliography">References</h2>
<ul id="L1" class="ltx_biblist">
<li id="bib.bib42" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[1]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">E. Brill and R. C. Moore</span><span class="ltx_text ltx_bib_year"> (2000)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">An improved error model for noisy channel spelling correction</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">ACL ’00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">Morristown, NJ, USA</span>, <span class="ltx_text ltx_bib_pages"> pp. 286–293</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://dx.doi.org/http://dx.doi.org/10.3115/1075218.1075255" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S5.p3" title="5 Future Work and Discussion ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>.
</span>
</li>
<li id="bib.bib28" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[2]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Church and W. Gale</span><span class="ltx_text ltx_bib_year"> (1991)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Probability scoring for spelling correction</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Statistics and Computing</span>, <span class="ltx_text ltx_bib_pages"> pp. 93–103</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p4" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib25" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[3]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">F. J. Damerau</span><span class="ltx_text ltx_bib_year"> (1964)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">A technique for computer detection and correction of spelling errors</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Commun. ACM</span> (<span class="ltx_text ltx_bib_number">7</span>).
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p1" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S1.p3" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib50" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[4]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. Hakulinen, M. Vilkuna, R. Korhonen, V. Koivisto, Heinonen and I. Alho</span><span class="ltx_text ltx_bib_year"> (2008)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Iso suomen kielioppi</span>.
</span>
<span class="ltx_bibblock"> <span class="ltx_text ltx_bib_publisher">Suomalaisen Kirjallisuuden Seura</span>.
</span>
<span class="ltx_bibblock">Note: <span class="ltx_text ltx_bib_note">referred on 31.12.2008, available from <span class="ltx_ERROR undefined">\url</span>http://kaino.kotus.fi/visk</span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS3.p2" title="2.3 Finnish-Specific Finite-State Weighting Methods ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.3</span></a>.
</span>
</li>
<li id="bib.bib18" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[5]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Kukich</span><span class="ltx_text ltx_bib_year"> (1992)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Techniques for automatically correcting words in text</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">ACM Comput. Surv.</span> <span class="ltx_text ltx_bib_volume">24</span> (<span class="ltx_text ltx_bib_number">4</span>), <span class="ltx_text ltx_bib_pages"> pp. 377–439</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text issn ltx_bib_external">ISSN 0360-0300</span>,
<a href="http://dx.doi.org/http://doi.acm.org/10.1145/146370.146380" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p3" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S1.p4" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib27" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[6]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">V. I. Levenshtein</span><span class="ltx_text ltx_bib_year"> (1966)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Binary codes capable of correcting deletions, insertions, and reversals</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Soviet Physics—Doklady 10, 707â710. Translated from Doklady Akademii Nauk SSSR</span>, <span class="ltx_text ltx_bib_pages"> pp. 845–848</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p3" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib43" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[7]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">E. Mays, F. J. Damerau and R. L. Mercer</span><span class="ltx_text ltx_bib_year"> (1991)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Context based spelling correction</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Inf. Process. Manage.</span> <span class="ltx_text ltx_bib_volume">27</span> (<span class="ltx_text ltx_bib_number">5</span>), <span class="ltx_text ltx_bib_pages"> pp. 517–522</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text issn ltx_bib_external">ISSN 0306-4573</span>,
<a href="http://dx.doi.org/http://dx.doi.org/10.1016/0306-4573(91)90066-U" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p4" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S5.p2" title="5 Future Work and Discussion ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>.
</span>
</li>
<li id="bib.bib8" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[8]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">R. Mitton</span><span class="ltx_text ltx_bib_year"> (2009)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Ordering the suggestions of a spellchecker without using context*</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Nat. Lang. Eng.</span> <span class="ltx_text ltx_bib_volume">15</span> (<span class="ltx_text ltx_bib_number">2</span>), <span class="ltx_text ltx_bib_pages"> pp. 173–192</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text issn ltx_bib_external">ISSN 1351-3249</span>,
<a href="http://dx.doi.org/http://dx.doi.org/10.1017/S1351324908004804" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p2" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib60" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[9]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">M. Mohri and M. Riley</span><span class="ltx_text ltx_bib_year"> (2002)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">An efficient algorithm for the n-best-strings problem</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS1.p3" title="2.1 Weighted Finite-State Interpretation of the Method ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.1</span></a>.
</span>
</li>
<li id="bib.bib44" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[10]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">P. Norvig</span><span class="ltx_text ltx_bib_year"> (2010)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">How to write a spelling corrector</span>.
</span>
<span class="ltx_bibblock">Note: <span class="ltx_text ltx_bib_note">Web Page, Visited February 28th 2010, Available <span class="ltx_ERROR undefined">\url</span>http://norvig.com/spell-correct.html</span>
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://norvig.com/spell-correct.html" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS2.p1" title="2.2 English-Specific Finite-State Weighting Methods ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.2</span></a>,
<a href="#S3.SS1.p1" title="3.1 English corpora ‣ 3 Material ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3.1</span></a>,
<a href="#S3.p1" title="3 Material ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>,
<a href="#S4.SS1.p1" title="4.1 English Error-Analysis ‣ 4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.1</span></a>.
</span>
</li>
<li id="bib.bib1" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[11]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">J. Otero, J. Graña and M. Vilares</span><span class="ltx_text ltx_bib_year"> (2008-10-19)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Contextual spelling correction.</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">EUROCAST</span>, <span class="ltx_text ltx_bib_editor">R. Moreno-DÃaz, F. Pichler and A. Quesada-Arencibia (Eds.)</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_series">Lecture Notes in Computer Science</span>, Vol. <span class="ltx_text ltx_bib_volume">4739</span>, <span class="ltx_text ltx_bib_pages"> pp. 290–296</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text isbn ltx_bib_external">ISBN 978-3-540-75866-2</span>,
<a href="http://dblp.uni-trier.de/db/conf/eurocast/eurocast2007.html#OteroGV07" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p4" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib35" class="ltx_bibitem ltx_bib_proceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[12]</span>